Multiprocessor systems with caches and shared memory space need to resolve the problem of keeping shared data coherent. This means that the most recently written data to a memory location from one processor needs to be visible to other processors in the system immediately after a synchronization point. The synchronization point is normally implemented as a barrier through a lock (semaphore) where the process that stores data releases a lock after finishing the store sequence. When the lock is released, the most recent version of the data that was stored by the releasing process can be read by another process. If there where no individual caches or buffers in the datapaths between processors and the shared memory system, this would be a relatively trivial task. The presence of caches changes this picture dramatically. Even if the caches were so-called write-through, in which case all store operations proceed all the way to main memory, it would still mean that the processor wanting to read shared data could have copies of previous versions of the data in the cache and such a read operation would return a stale copy seen from the producer process unless certain steps are taken to avoid the problem. Early cache systems would simply require a cache clear operation to be issued before reading the shared memory buffer in which case the reading process would encounter cache misses and automatically fetch the most recent copy of the data from main memory. This could be done without too much loss of performance when the speed difference between memory, cache and processors was relatively small. As these differences grew and write-back caches were introduced, such coarse-grained coherency mechanisms were no longer adequate. A write-back cache will only store data back to main memory upon cache line replacements, i.e. when the cache line will be replaced due to a cache miss of some sort. To accommodate this functionality, a new cache line state, “dirty”, was introduced in addition to the traditional “hit/miss” states.
Write-back caches were introduced to reduce the requirements for memory bandwidth for store operations. It is a fact that many programs perform subsequent stores to the same locations and storing the data in the cache and only update memory on cache misses, reduces the memory bandwidth requirements substantially. For obvious reasons, namely due to the necessity of updating memory with the most recent version of the data when the cache line was to be replaced, the “dirty” state was necessary even for single processor systems. This may more precisely be called memory coherence rather than cache coherence, but it is really just another side of the same story.
Read about Numascale's cache coherence technology →here
With write-back caches and multiple processors sharing the same memory, the cache coherence scheme grows in complexity. The main reason for the added complexity is the need to avoid efficiency loss due to the huge discrepancy in main memory performance compared with processor speed. Multiprocessing systems with standard Intel or AMD processors used to be connected with a so-called front-side bus. Since all processors were connected to the bus, they could all listen in to all the transaction on the bus. These processors used a so-called “snooping” scheme to listen to the bus transactions and depending on the state of their cache directory, they would “snoop” the information from the bus to update or invalidate particular cache lines to ensure that the most recent data would always be available for the processor that wanted to use it. This scheme still requires the same synchronization semantics as mentioned above, but there would be no need to invalidate or update any other part of the cache than the line that was affected at any given time. Several schemes were invented to define the cache line states that would give the best overall performance. One such scheme is called “MESI” which is an abbreviation for “Modified – Exclusive – Shared – Invalid”, another is “MOESI” that adds an “Owned” state. (For reference, see wikipedia http://en.wikipedia.org/wiki/MESI).
The snooping protocols requirement that all caches listen to all transactions is a severe limitation for scalability. Much in the same way as a bus has scalability limitations due to physical constraints regarding length and loads, a snooping scheme will very quickly swamp the communication channels between processors with snooping traffic with a growing number of processors. Since every processor needs to receive information from all other processors, the snooping traffic grows proportionally to n2.
Numascale's technology is →directory based.
1.3 Directory based cache coherence
Since sending information to all others (or for all having to listen to everything) is not a scalable solution, multiprocessing architects invented a different scheme. This dates back to the late 1980s – early 1990s when several projects were established to solve this rather crucial problem for shared memory multiprocessing. One such project was called “SCI” for Scalable Coherent Interface” and was established as an IEEE standards committee with people from HP, Apple, Data General, Dolphin, US Navy and several others. Another such project was the “DASH” project at Stanford. Eventually, the SCI approach was implemented by Dolphin and first used in the Convex Exemplar supercomputer. Later implementations of SCI were used in Sequent’s NUMA-Q (acquired by IBM) and Data General Aviion. The DASH project was carried over to form the basic architecture for SGI’s NumaLink used in the Origin and Altix multiprocessor systems.
Common to both SCI and DASH was the directory based cache coherence architecture. This is a very scalable approach due to the way the coherency information is distributed among the compute nodes. In a directory based cache coherency scheme, each physical node has a directory that contains information about the state of the memory on that particular node. It also has an additional cache with a directory that contains pointers to the nodes that share data with the particular cache line in question. The pointers are organized as linked lists limiting the amount of extra information in the directory to the size of one such pointer per cache line. The size of the pointer defines the maximum number of physical nodes that can be connected. For data that is shared; the pointer will point to the next node in the linked list of nodes that share the particular line. This means that information about a necessary cache state change only needs to be sent to those nodes that actually share the data and not to any other nodes (like in a snooping scheme to all other nodes). This drastically reduces the amount of cache coherency information that needs to be exchanged in the system with a corresponding reduction in the bandwidth requirements to manageable levels. The efficiency of the scheme is closely connected with the behaviour of the application (as with all cache based solutions) and it enables efficient implementation of large shared memory systems. The expense is some overhead in the cache directories that store the linked list information. This overhead is relatively small (2-5%) and with the strong growth in memory density and reduction in cost per stored bit, it is a fair price to pay for the scalability feature.
1.4 NUMA or ccNUMA
The term Non-Uniform Memory Access means that memory accesses from a given processor in a multiprocessor system will have different access time depending on the physical location of the memory. The "cc" means Cache Coherence, (i.e. all memory references from all processors will return the latest updated data from any cache in the system automatically.)
Modern multiprocessor systems where processors are connected through a hierarchy of interconnects (or buses) will perform differently depending on the relative physical location of processors and the memory being accessed. Most small multiprocessor systems used to be symmetric (SMP – Symmetric Multi-Processor) having a couple of processors hooked up to a main memory system through a central bus where all processors had equal priority and the same distance from the memory. This was changed through AMD’s introduction of Hypertransport™ with on-chip DRAM controllers. When more than one such processor chip (normally also with multiple CPU cores per chip) are connected, the criteria for being categorized as NUMA are fulfilled. This is illustrated by the fact that reading or writing data from or to the local memory is indeed quite a bit faster than performing the same operations on the memory that is controlled by the other processor chip.
Seen from a programming point of view, the uniform, global shared address space is a great advantage since any operand or variable can be directly referenced from he program through a single load or store operation. This is both simple and incurs no overhead. In addition, the large physical memory capacity eliminates the need for data decomposition and the need for explicit message passing between processes. It is also of great importance that the programming model can be maintained throughout the whole range of system sizes from small desktop systems with a couple of processor cores to huge supercomputers with thousands of processors.
The Non-Uniformity seen from the program is only coupled with the access time difference between different parts of the physical memory. Allocation of physical memory is handled by the operating system (OS) and modern OS kernels include features for optimizing allocation of memory to match with the scheduling of processors to execute the program.
If a programmer wishes to influence memory allocation and process scheduling in any way, OS calls to pin-down memory and reserve processors can be inserted in the code. This is of course less desirable than to let the OS handle it automatically because it increases the efforts required to port the application between different operating systems.
In normal cases with efficient caching of remote accesses, the automatic handling by the OS will be sufficient to ensure high utilization of processors even if the memory is physically scattered between the processing nodes.
Figure 1, Schematic Overview of Non-Uniform Memory Access versus Symmetrical Memory Access