Scalable IBM GPU Systems


→Download the IBM Scalable GPU Solutions Flyer


numa_pdfs/Scalable GPU Systems


→Download Scalable GPU Solutions Flyer


IBM Numaserver


→Download the IBM Numaserver Flyer



x3755Two server


→Download the IBM x3755 Flyer

























Numascale GPU Systems - Scalability Running a Single Image OS

Scalable GPU Solutions

Adding up standard servers will scale your GPU system within a single image operating system and environment with scalable and shared memory and open for better utilization of the GPUs' computing power. Benefits for this solution are easy expansion by standard components and simple operation and programming with shared memory and resources running a single OS.


The standard solution for large GPU systems is to cluster hosts with a limited number of GPUs with Ethernet or Infiniband technology. As in all clusters, a complex message passing software paradigm has to be used and the cluster suffers from nodes with limited size. Individual copies of the OS for each server complicate the operation. Another way is to mount the GPUs in large PCIe systems in high-end servers or in separate cabinets. This solution easily hits scalability limitations imposed by the bus system or server communication.

Based on its unique NumaConnect technology Numascale offers a scalable and expandable solution. Adding up standard servers will scale your GPU system within a single image operating system environment with scalable shared memory.

The big differentiator for NumaConnect compared to other high-speed interconnect technologies is the shared memory and cache coherency mechanisms. These features allow programs to access any memory location and any memory mapped I/O device, like a GPU, in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. NumaConnect also supports low latency MPI.

System administration relates to a unified system as opposed to a large number of separate images in a cluster - less effort to maintain.

Resources can be mapped and used by any processor in the system - optimal use of resources in a standard OS environment.

Process scheduling is synchronized through a single, real-time clock - avoids serialization of scheduling associated with asynchronous operating systems in a cluster and the corresponding loss of efficiency.


Recommended Server Options

IBM x3755 Numaserver


IBM x3755 Numaserver

Nummascale 1U or 2U GPU System AIC Server

1U and 2U Numascale GPU Systemes AIC Server

One Single Operating System


A NumaConnect shared memory system is operated by a single image operating system. This reduces the effort for maintaining operating system software and applications and leaves more of the combined memory space available for applications.



NumaConnect Adapter Card N323



NumaConnect Essentials

ccNuma and Numa low latency shared memory interconnect

Virtualizes Everything, Including Memory and IO

>10x price/performance benefit over proprietary solutions

Seamless Scaling of Application Size and Performance - NO Porting Efforts

Scalable, Cache Coherent, Shared Memory System Interconnect
AMD processor nodes with Coherent HyperTransport
Based on field proven design
Enables commodity cost level for high-end servers


NumaConnect Features

Converts between snoop-based (broadcast) and directory based coherency protocols

Write-back to Remote Cache

Non-coherent transactions (for optimized MPI)

Pipelined memory access (16 outstanding transactions + 16 non-coherent)

Remote Cache size up to 4GBytes (remote data)


NumaConnect RAS Features

ECC for single bit correction and double bit detection

Automatic scrubbing after single bit error detection

Automatic background scrubbing to minimize probability of soft error accumulation

Flexible micro-coded coherence processing engine

Watch-bus for internal activity observation in real-time

Built-in Performance Counters


NumaConnect Specifications

Bandwidth to the node-local CPUs
- 1 cHT link (16+16) @800MHz DDR = 6.4GB/s over HT Proprietary Cable


Latency for remote accesses
- Short time in node by-pass FIFOs
- Few "hops" on an average access patterns
- Only one or two dimension switch delays worst-case for 2-D or 3-D Torus topologies

Link Speed and capacity
- 4 lane SerDes 4Gb/s per link, 6 links => 96Gb/s = 9.6GBytes/s x2 => 19.2GBytes/s
- Net average throughput on a ring is ≈1.4 times, unidirectional link speed with random access patterns less link and fabric overhead, total for 6 links => 26.9GBytes/s (multiple senders can be active simultaneously)


Remote Cache (RMC)
- 2 or 4 GBytes per node, configurable
- System Performance expected to be more dependent on large size rather than faster access time => use of DRAM
- RMC access time will be close to neighbor CPU node-local memory access time

Address Range
- 12 bits Node ID = 4k nodes max. (Multiple sockets per node possible)
- 48 bits address (256 Terabytes)

- Local Node address range:

• N323-22 56 GigaBytes

• N323-44 112 GigaBytes

• N323-48 240 GigaBytes


The Change is on

From Cluster
shared memory with ccnuma by numachip
To Scalable ccNUMA SMP

Solution Key Features

Nvidia, AMD or other

3TB DRAM in GPU Server Nodes

Unlimited Shared System Memory

2D or 3D NumaConnect Fabric

Performance and Cost Scaling


IBM x3755 Numaserver Features

IBM x3755 M3 server
2U Rack Mount
1GPU - Single Slot
2-3 AMD 63xx CPU
8 to 48 cores
Up to 192GB DRAM
NumaConnect Adapter
1-3x1100W Power Supply
Triple Hot-Swap Power
Up to 8 hot-swap SAS/SATA
Optional up to 6 simple-swap SAT
Optional RAID Controller

1U AIC Server Features

1GPU - Single Slot
AMD 43xx CPU
8 cores
Up to 96GB DRAM
NumaConnect Adapter
1 Power supply 650W
2 SAS/SATA Disk Drive Bays

2U AIC Server Features

1-2 GPUs - Dual or Single Slot
AMD 43xx CPU
8 cores
Up to 96GB DRAM
NumaConnect Adapter
2 Hot-swap power supplies @650W
2 SAS/SATA Disk Drive Bays

Cache Coherence - ccNuma - Clusters - Coherent - Directory Based Cache Coherence - Hypertransport - InfiniBand - Numa - NumaChip - Numascale - Snooping