NumaChip Essentials

Scalable, Cache Coherent, Shared Memory System Interconnect
AMD processor nodes with Coherent HyperTransport
Based on field proven design
Enables commodity cost level for high-end servers


NumaChip Features

Converts between snoop-based (broadcast) and directory based coherency protocols

Write-back to Remote Cache

Non-coherent transactions (for optimized MPI)

Pipelined memory access (16 outstanding transactions + 16 non-coherent)

Remote Cache size up to 16GBytes (remote data)


NumaChip RAS Features

ECC for single bit correction and double bit detection

Automatic scrubbing after single bit error detection

Automatic background scrubbing to minimize probability of soft error accumulation

Flexible micro-coded coherence processing engine

Watch-bus for internal activity observation in real-time

Built-in Performance Counters


NumaChip Specifications

Bandwidth to the node-local CPUs
- 1 cHT link (16+16) @800MHz DDR = 6.4GB/s over HTX


Latency for remote accesses
- Short time in node by-pass FIFOs
- Few "hops" on an average access patterns
- Only one or two dimension switch delays worst-case for 2-D or 3-D Torus topologies

Link Speed and capacity
- 4 lane SerDes 4Gb/s per link, 6 links => 96Gb/s = 9.6GBytes/s x2 => 19.2GB/s
- Average throughput on a ring is ≈1.6 times unidirectional link speed with random access patterns, total for 6 links => 30.7GBytes/s (multiple senders can be active simultaneously)


Remote Cache (RMC)
- Up to 8GBytes per Node
- System Performance expected to be more dependent on large size rather than faster access time => use of DRAM
- RMC access time will be close to neighbor CPU node-local memory access time

Address Range
- 12 bits Node ID = 4k nodes max. (Multiple sockets per node possible)
- 48 bits address (256 Terabytes)

The Change is on

From Cluster
shared memory with ccnuma by numachip
To Scalable ccNUMA SMP







