NumaConnect - the Technology
Cache Coherent Shared Memory
The big differentiator for NumaChip compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. The architecure is commonly classified as ccNUma or Numa but the interconnect system can alternatively be used as low latency clustering interconnect.
There are a number of pros for shared memory machines that lead experts to hold the architecture as the holy grail of computing compared to clusters:
- Any processor can access any data location through direct load and store operations - easier programming, less code to write and debug
- Compilers can automatically exploit loop level parallelism - higher efficiency with less human effort
- System administration relates to a unified system as opposed to a large number of separate images in a cluster - less effort to maintain
- Resources can be mapped and used by any processor in the system - optimal use of resources in a virtualized environment
- Process scheduling is synchronized through a single, real-time clock - avoids serialization of scheduling associated with asynchronous operating systems in a cluster and the corresponding loss of efficiency
These features are all available in high cost mainframe systems from IBM, Sun, HP and Silicon Graphics. The only catch is that these systems hold a price tag that is 30 times higher per CPU core compared with commodity servers. In the low end, the multiprocessor machines from Intel and AMD have proven multiprocessing to be extremely popular with the commodity price levels: The dual processor socket machines are by far selling in the highest volumes.
Integration in new semiconductor technology
The NumaChip project is first and foremost a project for integrating existing intellectual property in modern semiconductor technology. The two chips in the figure below contain the logic for the cache coherency functionality and the logic to handle transactions between servers with multiple processors, memory and caches. The cache coherency is managed by a specialized processor, called a transaction engine. The transaction engine is in turn controlled by a microprogram that is loaded into the chip during the power-up sequence. This means that functions can be field-upgraded by software.
The previous version of this chip set also required two chips to interface to the processor bus of the Intel P6 family (Pentium Pro was the first generation). The new chip integrates the functionality of the cache controller, 6 link controllers, the equivalent of the two bus interface chips in the form of a HyperTransport interface and 6 4-lane serial interfaces. This saves space, power consumption and cost.
Designed for Scalability and Robustness
The original design was aimed at scaling to very large numbers of processors. To accommodate this, design decisions were made that are still valid.
These design decisions cover areas like global address space of 64 bits, with 16 bits being able to address 65 536 physical nodes and each addressable node can hold multiple processors. Each node can have 256 TeraBytes of memory adding up to a maximum system addressing capacity of 16 ExaBytes (2**64).
NumaChip implements 12 bits for the physical node address, limiting the number of nodes in a single image system to 4,096. Each node can have up to 16 processor cores.
Functionality was included to manage robustness issues associated with high node counts and extremely high requirements for data integrity with the ability to provide high availability for systems managing critical data in transaction processing and real-time control.
A directory based cache coherence protocol was developed to handle scaling with significant number of nodes sharing data to avoid overloading the interconnect between nodes with coherency traffic which would seriously reduce real data throughput.
The basic ring topology with distributed switching allows a number of different interconnect configurations that are more scalable than most other interconnect switch fabrics. This also eliminates the need for a centralized switch and includes inherent redundancy for multidimensional topologies.
Integrated, distributed switching
NumaChip contains an on-chip switch to connect to other nodes in a NumaChip based system and eliminating the need to use a centralized switch. The on-chip switch can connect systems in one, two or three dimensions. Small systems can use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors.
The two- and three-dimensional topologies (called Torus) also have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.
The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.

