NumaConnect - the Technology
Cache Coherent Shared Memory
The big differentiator for NumaConnect compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. The architecture is commonly classified as ccNuma or Numa but the interconnect system can alternatively be used as low latency clustering interconnect.
There are a number of pros for shared memory machines that lead experts to hold the architecture as the holy grail of computing compared to clusters:
- Any processor can access any data location through direct load and store operations - easier programming, less code to write and debug
- Compilers can automatically exploit loop level parallelism - higher efficiency with less human effort
- System administration relates to a unified system as opposed to a large number of separate images in a cluster - less effort to maintain
- Resources can be mapped and used by any processor in the system - optimal use of resources in a single image operating system environment
These features are all available in high cost mainframe systems. The main catch is that these systems come with a price tag that is more than 10 times higher per CPU core compared with commodity servers based on the x86-64 architecture.
The architecture heritage of NumaConnect dates back to the development of the IEEE standard 1596, Scalable Coherent Interface (SCI). SCI was architected upon three major pillars, scalability, global shared address space and cache/memory coherence.
These principles led to a definition of the packet format with support for a global address space of 64 bits, with 16 bits to address 65 536 physical nodes where each node can hold multiple processors. Each node can then have 256 TeraBytes of memory adding up to a maximum system addressing capacity of 16 ExaBytes (2**64). In that respect, the architects had foresight to envision systems in the exascale range.
Numascale has chosen to call the technology and board level products for NumaConnect™ and the integrated circuit that contains the logic for NumaChip™.
Version 1 of NumaChip is implemented in a standard cell ASIC manufactured by IBM Semiconductor. It contains the logic for the cache coherence logic, interface to the AMD HyperTransport™ (HT), 2 DRAM controllers for the NumaCache™and tags, a mapping layer between the AMD HT snooping protocol and the directory based SCI protocol, a 7-way crossbar switch, 6 link controllers and 6 high speed serial ports with 4 lanes each.
NumaChip version 1 implements 12 bits for the physical node address, limiting the number of nodes in a single image system to 4,096. Each node supports multiple multicore processors.
Functionality was included to manage robustness issues associated with high node counts and high requirements for data integrity with the ability to provide high availability for systems managing critical data in transaction processing and real-time control.
The directory based cache coherence protocol was developed to handle scaling with significant number of nodes sharing data to avoid overloading the interconnect between nodes with coherency traffic which would seriously reduce real data throughput.
The basic ring topology with distributed switching allows a number of different interconnect configurations that are more scalable than most other interconnect switch fabrics. This also eliminates the need for a centralized switch and includes inherent redundancy for multidimensional topologies.
Integrated, distributed switching
NumaChip contains an on-chip switch to connect to other nodes in a NumaConnect system and eliminate the need to use a centralized switch. The on-chip switch can connect systems in one, two or three dimensions. Small systems can use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors. The fabric routing is controlled through routing tables that are initialized from system software at the BIOS level. This software is called the NumaConnect Bootloader.
The two- and three-dimensional topologies (called Torus) also have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.
The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.