Numascale Node Controller TechnologyLowering the barrier for building x86 based Scale-up systems
Scale-up Node Controller Technology
Numascale has developed a unique node controller hardware technology called NumaConnect™ to lower the barrier for system vendors to design scale-up systems. The architecture is based on a generic, global cache/memory coherency model.
The Numascale technology includes a modular chip microarchitecture that allows system vendors to choose the right configuration to support their performance requirements. The microarchitecture incudes several parallel micro-coded memory transaction engines with a large number of outstanding memory transactions, memory controller for the cache and memory tags, a cross-bar switch for the interconnect fabric and a number of interconnect fabric link controllers. This means that there is no need for any external interconnect fabric switch, all switching is performed on-chip and the interconnect fabric is implemented with wiring through a PCB backplane or cables. The entire design for NumaConnect™ is implemented in a single FPGA or ASIC called NumaChip™ depending on the system vendor’s requirements.
ArchitectureInnovative and groundbreaking coherent shared memory technology
The architecture heritage of NumaConnect dates back to the development of the IEEE standard 1596, Scalable Coherent Interface (SCI). SCI was architected upon three major pillars, scalability, global shared address space and cache/memory coherence.
These principles led to a definition of the packet format with support for a global address space of 64 bits, with 16 bits to address 65 536 physical nodes where each node can hold multiple processors. Each node can then have 256 TeraBytes of memory adding up to a maximum system addressing capacity of 16 ExaBytes (2**64). In that respect, the architects had foresight to envision systems in the exascale range.
The big differentiator for NumaConnect compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. The architecture is commonly classified as ccNuma or Numa but the interconnect system can alternatively be used as low latency clustering interconnect.
There are a number of pros for shared memory machines that lead experts to hold the architecture as the holy grail of computing compared to clusters:
- Any processor can access any data location through direct load and store operations – easier programming, less code to write and debug
- Compilers can automatically exploit loop level parallelism – higher efficiency with less human effort
- System administration relates to a unified system as opposed to a large number of separate images in a cluster – less effort to maintain
- Resources can be mapped and used by any processor in the system – optimal use of resources in a single image operating system environment
These features are all available in high cost mainframe systems. The main catch is that these systems come with a price tag that is more than 10 times higher per CPU core compared with commodity servers based on the x86-64 architecture.
Since the design of the NumaChip™ was originally done in ASIC technology, the microarchitecture was designed with emphasis on programmability. The main programmable unit is the MPE (Microprogrammed Protocol Engine). The design contains several parallel MPEs depending on customer requirements before the ASIC is manufactured or more dynamically if the target implementation vehicle is an FPGA. Each MPE contains a number of Multi Context Micro Sequencers (MCMs) that operate on Fully Associative Context Blocks (FACBs). The microarchitecture also contains a Block Transfer Engine (BTE) which is an RDMA-type functional unit that can be used directly from user level to enhance performance of block data transfers between different memory locations. The BTE is typically used by Byte Transfer Libraries (BTL) for for message passing (MPI) or BLACS (Basic Linear Algebra Communication System) for Scalapack or other customer defined functions. Solutions based on the FPGA platform can also be configured to accommodate customer defined functions for acceleration of specific algorithms. Processor agnostic microarchitecture The microprogrammed microarchitecture allows for short time to solution for supporting a new coherence protocol. The main changes that need to be done to support a new coherence protocol is of course to change the physical layer above the Processor Interface Unit (PIU) that communicates with the CPU module and to modify the firmware microcode that control the MPEs. The remaining parts of the design that account for approximately 75% of the logic gates and on-chip memory can remain unchanged.
NumaChip contains an on-chip switch to connect to other nodes in a NumaConnect system and eliminate the need to use a centralized switch. The on-chip switch can connect systems in one, two or three dimensions. Small systems can use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors. The fabric routing is controlled through routing tables that are initialized from system software at the BIOS level. This software is called the NumaConnect Bootloader.
The two- and three-dimensional topologies (called Torus) also have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.
The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.