DATE AND TIME: Tuesday November 18th, 5:15PM – 7:00PM
Scaling OpenMP Programs to Thousand Cores on the Numascale Architecture
Dirk Schmidl, Atle Vesterkjær
The downside of shared memory programming compared to message passing is the limitation to run on a single system. Numascale’s interconnect couples several standard servers in a cache coherent way which allows shared memory programming on the complete machine. However, this does not necessarily mean that shared memory programs deliver satisfying performance on such a system. Here we investigated a Numascale machine with 1728 cores hosted at the University of Oslo. The system consists of 72 IBM x3755 M3 nodes coupled in a 3D torus network topology with Numascale’s interconnect. We investigate the memory bandwidth with kernel benchmarks and furthermore look at an application developed at the Institute of Combustion Technology at RWTH Aachen University. We describe all tuning steps done so far to optimize the application for large SMP machines like the Numascale machine and present good performance results for OpenMP runs with 1024 threads on the Oslo system.