NumaConnect SMP Adapter Card
Scalable Cache Coherent Shared Memory on your Cluster Budget
Numascale's SMP Adapter is an HTX card made to be used with commodity servers with AMD processors that feature an HTX connector to its HyperTransport interconnect.

Highlights
Scalable, Directory Based Cache Coherence Protocol
Write-back cache for Remote Data: 2-4-8-(16)GB options, standard SDIMMs
ECC protected with background scrubbing of soft errors
16 coherent + 16 non-coherent outstanding memory transactions
Support for single-image or multi-image OS partitions
3-way on-chip distributed switching for 1D, 2D or 3D Torus topologies
30GB/s switching capacity per node
HTX connected - 6.4GB/s
<20W power dissipation
NumaConnect OS Support
Linux, Windows Server, Solaris, Unix
NumaConnect Essentials
ccNuma and Numa low latency shared memory interconnect
Virtualizes Everything, Including Memory and IO
>10x price/performance benefit over proprietary solutions
Seamless Scaling of Application Size and Performance - NO Porting Efforts
Scalable, Cache Coherent, Shared Memory System Interconnect
AMD processor nodes with Coherent HyperTransport
Based on field proven design
Enables commodity cost level for high-end servers
NumaConnect Features
Converts between snoop-based (broadcast) and directory based coherency protocols
Write-back to Remote Cache
Non-coherent transactions (for optimized MPI)
Pipelined memory access (16 outstanding transactions + 16 non-coherent)
Remote Cache size up to 16GBytes (remote data)
NumaConnect RAS Features
ECC for single bit correction and double bit detection
Automatic scrubbing after single bit error detection
Automatic background scrubbing to minimize probability of soft error accumulation
Flexible micro-coded coherence processing engine
Watch-bus for internal activity observation in real-time
Built-in Performance Counters
NumaConnect Specifications
Bandwidth to the node-local CPUs
-
1 cHT link (16+16) @800MHz DDR = 6.4GB/s over HTX
Latency for remote accesses
-
Short time in node by-pass FIFOs
-
Few "hops" on an average access patterns
-
Only one or two dimension switch delays worst-case for 2-D or 3-D Torus topologies
Link Speed and capacity
-
4 lane SerDes 4Gb/s per link, 6 links => 96Gb/s = 9.6GBytes/s x2 => 19.2GB/s
-
Average throughput on a ring is ≈1.6 times unidirectional link speed with random access patterns, total for 6 links => 30.7GBytes/s (multiple senders can be active simultaneously)
Remote Cache (RMC)
-
2 or 4 GBytes per node, configurable
-
System Performance expected to be more dependent on large size rather than faster access time => use of DRAM
-
RMC access time will be close to neighbor CPU node-local memory access time
Address Range
-
12 bits Node ID = 4k nodes max. (Multiple sockets per node possible)
-
48 bits address (256 Terabytes)


