Numascale GPU Systems - Scalability Running a Single Image OS
Scalable GPU Solutions
Adding up standard servers will scale your GPU system within a single image operating system and environment with scalable and shared memory and open for better utilization of the GPUs' computing power. Benefits for this solution are easy expansion by standard components and simple operation and programming with shared memory and resources running a single OS.
The standard solution for large GPU systems is to cluster hosts with a limited number of GPUs with Ethernet or Infiniband technology. As in all clusters, a complex message passing software paradigm has to be used and the cluster suffers from nodes with limited size. Individual copies of the OS for each server complicate the operation. Another way is to mount the GPUs in large PCIe systems in high-end servers or in separate cabinets. This solution easily hits scalability limitations imposed by the bus system or server communication.
Based on its unique NumaConnect technology Numascale offers a scalable and expandable solution. Adding up standard servers will scale your GPU system within a single image operating system environment with scalable shared memory.
The big differentiator for NumaConnect compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device, like a GPU, in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. NumaConnect also supports low latency MPI.
System administration relates to a unified system as opposed to a large number of separate images in a cluster - less effort to maintain.
Resources can be mapped and used by any processor in the system - optimal use of resources in a standard OS environment.
Process scheduling is synchronized through a single, real-time clock - avoids serialization of scheduling associated with asynchronous operating systems in a cluster and the corresponding loss of efficiency.
Recommended Server Options
IBM x3755 Numaserver
Nummascale 1U or 2U GPU System AIC Server
One Single Operating System
A NumaConnect shared memory system is operated by a single image operating system. This reduces the effort for maintaining operating system software and applications and leaves more of the combined memory space available for applications.
ccNuma and Numa low latency shared memory interconnect
Virtualizes Everything, Including Memory and IO
>10x price/performance benefit over proprietary solutions
Seamless Scaling of Application Size and Performance - NO Porting Efforts
Scalable, Cache Coherent, Shared Memory System Interconnect
AMD processor nodes with Coherent HyperTransport
Based on field proven design
Enables commodity cost level for high-end servers
Converts between snoop-based (broadcast) and directory based coherency protocols
Write-back to Remote Cache
Non-coherent transactions (for optimized MPI)
Pipelined memory access (16 outstanding transactions + 16 non-coherent)
Remote Cache size up to 4GBytes (remote data)
NumaConnect RAS Features
ECC for single bit correction and double bit detection
Automatic scrubbing after single bit error detection
Automatic background scrubbing to minimize probability of soft error accumulation
Flexible micro-coded coherence processing engine
Watch-bus for internal activity observation in real-time
Built-in Performance Counters
Bandwidth to the node-local CPUs
- 1 cHT link (16+16) @800MHz DDR = 6.4GB/s over HT Proprietary Cable
Latency for remote accesses
- Short time in node by-pass FIFOs
- Few "hops" on an average access patterns
- Only one or two dimension switch delays worst-case for 2-D or 3-D Torus topologies
Link Speed and capacity
- 4 lane SerDes 4Gb/s per link, 6 links => 96Gb/s = 9.6GBytes/s x2 => 19.2GBytes/s
- Net average throughput on a ring is ≈1.4 times, unidirectional link speed with random access patterns less link and fabric overhead, total for 6 links => 26.9GBytes/s (multiple senders can be active simultaneously)
Remote Cache (RMC)
- 2 or 4 GBytes per node, configurable
- System Performance expected to be more dependent on large size rather than faster access time => use of DRAM
- RMC access time will be close to neighbor CPU node-local memory access time
- 12 bits Node ID = 4k nodes max. (Multiple sockets per node possible)
- 48 bits address (256 Terabytes)
- Local Node address range:
• N323-22 56 GigaBytes
• N323-44 112 GigaBytes
• N323-48 240 GigaBytes