Node Controller

Home - Node Controller

Scale-up Node Controller Technology

Numascale has developed a unique node controller hardware technology called NumaConnect™ to lower the barrier for system vendors to design scale-up servers. The architecture is based on a generic, global cache/memory coherency model. The architecture is inherently processor architecture agnostic and has to date been implemented for AMD Opteron with HyperTransport (HT1, HT2, HT3) through NumaChip1 (ASIC) and NumaChip-2 (FPGA) and NumasChip-3 (UPI 1.0) and now with Intel’s Ultra Path Interconnect™ (UPI 2.0) technology for Eagle Stream and Birch Stream platforms with NumaChip-5 (ASIC).

Architecture

Architecture

Innovative and groundbreaking coherent shared memory technology

The architecture heritage of NumaConnect™ dates back to the development of the IEEE standard 1596, Scalable Coherent Interface UiO(SCI). SCI was architected upon three major pillars, scalability, global shared address space and cache/memory coherence.

These principles led to a definition of the packet format with support for a global address space of 64 bits, with 16 bits to address 65 536 physical nodes where each node can hold multiple processors. Each node can then have 256 TeraBytes of memory adding up to a maximum system addressing capacity of 16 ExaBytes (2**64). In that respect, the architects had foresight to envision systems in the exascale range.

The big differentiator for NumaConnect compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with a high degree of efficiency.

It provides scalable servers with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain more than thousand processor cores and many Terabytes memory. This architecture is classified as ccNuma or just Numa.

There are a number of pros for shared memory machines that lead experts to hold the architecture as the holy grail of computing compared to clusters:

  • Any processor can access any data location through direct load and store operations – easier programming, less code to write and debug

  • Compilers can automatically exploit loop level parallelism – higher efficiency with less human effort

  • System administration relates to a unified system as opposed to a large number of separate images in a cluster – less effort to maintain

  • Resources can be mapped and used by any processor in the system – optimal use of resources in a single image operating system environment

  • No need to decompose or duplicate data sets for scaling – significantly shorter development time for applications like Scale-up and Scale-out for graph processing

These features are all available in high-cost mainframe systems. The main catch is that these systems come with a price tag that is much higher per CPU core compared with servers based on the x86 architecture.

The NumaConnect™ architecture is based on a 64-bit global address space where 16 bits are used to identify nodes and 48 bits are used for displacement address within a node. For current implementations where the CPUs support less than 64-bit physical address space, some of the most significant address bits are shifted up into the node address field. Cache lines Diagramare 64-bytes entities which fits with all modern processor architectures. The link-layer protocol supports a format with little overhead and with flexibility to use packets of 64 bytes for cache coherent transactions or 256 bytes for block transfers. This ensures high efficiency with respect to overhead and fairness for forward progress for all transactions. Scalability is ensured through a distributed, directory-based cache coherence protocol that efficiently reduces the amount of coherency (snoop) traffic that must traverse the interconnect fabric. The directory is implemented with pointers to nodes that share data for any particular cache line and maps the entire system memory. The interconnect fabric can be configured in different topologies by the use of routing tables. The number of fabric channels is implemented according to the customer requirements. The first implementation of NumaChip™ supported 6 x 4-lane channels to accommodate 3-D Torus topologies. The current NumaChip-3 implements 8 x 8-lane channels to support 8 inter-node links. One of the links is implemented as an expansion link to support up to 16-nodes with 32 CPU sockets. The links run at 25Gbps and supports the full Intel UPI 1.0 bandwidth between nodes.

Since the NumaChip™ in designed to be implemented in ASIC technology, the microarchitecture is designed with emphasis on programmability. The main programmable unit is the MPE (Microprogrammed Protocol Engine). The design contains several parallel MPEs depending on customer requirements before the ASIC is manufactured or more dynamically if the target implementation vehicle is an FPGA. Each MPE contains a number of Multi Context Micro Sequencers (MCMs) that operate on Fully Associative Context Blocks (FACBs). The microarchitecture also contains a Block Transfer Engine (BTE) which is an RDMA-type functional unit that can be used directly from user level to enhance performance of block data transfers between different memory locations. The BTE is typically used by Byte Transfer Libraries (BTL) for message passing (MPI) or BLACS (Basic Linear Algebra Communication System) for Scalapack or other customer defined functions. Solutions based on the FPGA platform can also be configured to accommodate customer defined functions for acceleration of specific algorithms. Processor agnostic microarchitecture The microprogrammed microarchitecture allows for short time to solution for supporting a new coherence protocol. The main changes that need to be done to support a new coherence protocol is of course to change the physical layer above the Processor Interface Unit (PIU) that communicates with the CPU module and to modify the firmware microcode that control the MPEs. The remaining parts of the design that account for approximately 75% of the logic gates and on-chip memory can remain unchanged.

The NumaChip architecture contains an on-chip switch to connect to other nodes in a NumaConnect system and eliminate the need to use a centralized switch. NumaChip-1 had a 7-port on-chip switch configured to connect systems in one, two or three dimensions. Small systems could use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors. NumaChip-3 implements a full crossbar with 8 external fabric ports. The fabric routing is controlled through routing tables that are initialized from system software at the BIOS level. This software is called the NumaConnect Bootloader.

Distributed switching topologies also have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.

The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.