Numascale Node Controller Technology

Lowering the barrier for building x86 based Scale-up systems

Scale-up Node Controller Technology

Numascale has developed a unique node controller hardware technology called NumaConnect™ to lower the barrier for system vendors to design scale-up systems. The architecture is based on a generic, global cache/memory coherency model.


The Numascale technology includes a modular chip microarchitecture that allows system vendors to choose the right configuration to support their performance requirements. The microarchitecture incudes several parallel micro-coded memory transaction engines with a large number of outstanding memory transactions, memory controller for the cache and memory tags, a cross-bar switch for the interconnect fabric and a number of interconnect fabric link controllers. This means that there is no need for any external interconnect fabric switch, all switching is performed on-chip and the interconnect fabric is implemented with wiring through a PCB backplane or cables. The entire design for NumaConnect™ is implemented in a single FPGA or ASIC called NumaChip™ depending on the system vendor’s requirements.  

Hypertransport Node Controllers


Innovative and groundbreaking coherent shared memory technology

The architecture heritage of NumaConnect dates back to the development of the IEEE standard 1596, Scalable Coherent Interface UiO(SCI). SCI was architected upon three major pillars, scalability, global shared address space and cache/memory coherence.

These principles led to a definition of the packet format with support for a global address space of 64 bits, with 16 bits to address 65 536 physical nodes where each node can hold multiple processors. Each node can then have 256 TeraBytes of memory adding up to a maximum system addressing capacity of 16 ExaBytes (2**64). In that respect, the architects had foresight to envision systems in the exascale range.

The big differentiator for NumaConnect compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. The architecture is commonly classified as ccNuma or Numa but the interconnect system can alternatively be used as low latency clustering interconnect.Architecture

There are a number of pros for shared memory machines that lead experts to hold the architecture as the holy grail of computing compared to clusters:

  • Any processor can access any data location through direct load and store operations – easier programming, less code to write and debug
  • Compilers can automatically exploit loop level parallelism – higher efficiency with less human effort
  • System administration relates to a unified system as opposed to a large number of separate images in a cluster – less effort to maintain
  • Resources can be mapped and used by any processor in the system – optimal use of resources in a single image operating system environment

These features are all available in high cost mainframe systems. The main catch is that these systems come with a price tag that is more than 10 times higher per CPU core compared with commodity servers based on the x86-64 architecture.

The NumaConnect™ architecture is based on a 64-bit global address space where 16 bits are used to identify nodes and 48 bits are used for displacement address within a node. For current implementations where the CPUs support less than 64-bit physical address space, some of the address bits are shifted up into the node address field. Cache lines Diagramare 64-bytes entities which fits with all modern processor architectures. The link-layer protocol supports a format with little overhead and with flexibility to use packets of 64 bytes for cache coherent transactions or 256 bytes for block transfers. This ensures high efficiency with respect to overhead and fairness for forward progress for all transactions. Scalability is ensured through a directory based cache coherence protocol that efficiently reduces the amount of coherency traffic that needs to traverse the interconnect fabric. The directory is implemented as a distributed doubly-linked list with pointers to nodes that share data for any particular cache line. The architecture also supports a configurable cache, NumaCache, that is implemented as a level-4 write-back cache to hold data that belongs to other node’s memories. This reduces overall latency and improves bandwidth for remote memory transactions. The interconnect fabric can be configured in different topologies by the use of routing tables. The number of fabric channels is implemented according to the customer requirements. The first implementation of NumaChip™ supported 6 4-lane serial channels to accommodate 3-D Torus topologies. New designs are now targeting up to 8 4-lane channels or 4 8-lane channels to support a reasonable number of direct links between system nodes main boards.

Since the design of the NumaChip™ was originally done in ASIC technology, the microarchitecture was designed with emphasis on programmability. The main programmable unit is the MPE (Microprogrammed Protocol Engine). The design contains several parallel MPEs depending on customer requirements before the ASIC is manufactured or more dynamically if the target implementation vehicle is an FPGA. Each MPE contains a number of Multi Context Micro Sequencers (MCMs) that operate on Fully Associative Context Blocks (FACBs). The microarchitecture also contains a Block Transfer Engine (BTE) which is an RDMA-type functional unit that can be used directly from user level to enhance performance of block data transfers between different memory locations. The BTE is typically used by Byte Transfer Libraries (BTL) for for message passing (MPI) or BLACS (Basic Linear Algebra Communication System) for Scalapack or other customer defined functions. Solutions based on the FPGA platform can also be configured to accommodate customer defined functions for acceleration of specific algorithms. Processor agnostic microarchitecture The microprogrammed microarchitecture allows for short time to solution for supporting a new coherence protocol. The main changes that need to be done to support a new coherence protocol is of course to change the physical layer above the Processor Interface Unit (PIU) that communicates with the CPU module and to modify the firmware microcode that control the MPEs. The remaining parts of the design that account for approximately 75% of the logic gates and on-chip memory can remain unchanged.

fabricNumaChip contains an on-chip switch to connect to other nodes in a NumaConnect system and eliminate the need to use a centralized switch. The on-chip switch can connect systems in one, two or three dimensions. Small systems can use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors. The fabric routing is controlled through routing tables that are initialized from system software at the BIOS level. This software is called the NumaConnect Bootloader.

The two- and three-dimensional topologies (called Torus) also have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.

The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.