Line

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

oOo

NumaConnect - the Technology

Cache Coherent Shared Memory

The big differentiator for NumaChip compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. The architecure is commonly classified as ccNUma or Numa but the interconnect system can alternatively be used as low latency clustering interconnect.

 

There are a number of pros for shared memory machines that lead experts to hold the architecture as the holy grail of computing compared to clusters:

 

These features are all available in high cost mainframe systems from IBM, Sun, HP and Silicon Graphics. The only catch is that these systems hold a price tag that is 30 times higher per CPU core compared with commodity servers. In the low end, the multiprocessor machines from Intel and AMD have proven multiprocessing to be extremely popular with the commodity price levels: The dual processor socket machines are by far selling in the highest volumes.

 

Integration in new semiconductor technology

The NumaChip project is first and foremost a project for integrating existing intellectual property in modern semiconductor technology. The two chips in the figure below contain the logic for the cache coherency functionality and the logic to handle transactions between servers with multiple processors, memory and caches. The cache coherency is managed by a specialized processor, called a transaction engine. The transaction engine is in turn controlled by a microprogram that is loaded into the chip during the power-up sequence. This means that functions can be field-upgraded by software.

 

The previous version of this chip set also required two chips to interface to the processor bus of the Intel P6 family (Pentium Pro was the first generation). The new chip integrates the functionality of the cache controller, 6 link controllers, the equivalent of the two bus interface chips in the form of a HyperTransport interface and 6 4-lane serial interfaces. This saves space, power consumption and cost.

 

Designed for Scalability and Robustness

The original design was aimed at scaling to very large numbers of processors. To accommodate this, design decisions were made that are still valid.

These design decisions cover areas like global address space of 64 bits, with 16 bits being able to address 65 536 physical nodes and each addressable node can hold multiple processors. Each node can have 256 TeraBytes of memory adding up to a maximum system addressing capacity of 16 ExaBytes (2**64).

 

NumaChip implements 12 bits for the physical node address, limiting the number of nodes in a single image system to 4,096. Each node can have up to 16 processor cores.

Functionality was included to manage robustness issues associated with high node counts and extremely high requirements for data integrity with the ability to provide high availability for systems managing critical data in transaction processing and real-time control.

 

A directory based cache coherence protocol was developed to handle scaling with significant number of nodes sharing data to avoid overloading the interconnect between nodes with coherency traffic which would seriously reduce real data throughput.

 

The basic ring topology with distributed switching allows a number of different interconnect configurations that are more scalable than most other interconnect switch fabrics. This also eliminates the need for a centralized switch and includes inherent redundancy for multidimensional topologies.

 

Integrated, distributed switching

NumaChip contains an on-chip switch to connect to other nodes in a NumaChip based system and eliminating the need to use a centralized switch. The on-chip switch can connect systems in one, two or three dimensions. Small systems can use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors.

 

The two- and three-dimensional topologies (called Torus) also have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.

 

The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.

The Change is on

 

From Cluster

cluster

shared memory with ccnuma by numachip

To Scalable ccNUMA SMP
Cache Coherence - ccNuma - Clusters - Coherent - Directory Based Cache Coherence - Hypertransport - InfiniBand - Numa - NumaChip - Numascale - Snooping
Line