- You must login in order to post into this group.
MCP: SeRC Exascale Simulation Software Initiative - SESSI
High resolution logo here.
The SeRC Steering group established a new flagship program on software for Exascale simulations. The program aims to improve the performance and scalability of selected software packages by establishing new collaborative projects for exchange of expertise among research groups in the Molecular Simulations, FLOW and DPT SeRC communities.
SESSI meeting at the SeRC room in PDC.
Many of the e-Science applications are more and more dependent on large-scale compute capabilities. A subset of these applications have already developed into massively parallel codes where a single simulation can use thousands to hundreds of thousands of cores. Those capabilities have in turn enabled researchers to address a wide range of new applications areas that were simply not possible a few years ago such as computational fluid dynamics simulations of highly complex flows such as airplane wings; critical problems in high energy physics; and molecular dynamics simulations of large biomolecular systems. The advances in those areas were possible thanks to major directed efforts in the development of algorithms and application of modern software engineering techniques.
- Molecular Dynamics. The code in use is GROMACS (MOL community).
- Computational Fluid Dynamics. The code in use is Nek5000 (FLOW community).
Assembly level vectorization of compute intensive kernels in NEK5000
Nek5000 is a computational fluid dynamics solver based on the spectral element method. The main part of the program consists in matrix-matrix multiplication routines, in which the program spends most of its time (more than 60% in a 2D version).
Currently, the routines are basic Fortran routines with nested loops to compute the matrix multiplications. The aim of this project is to enhance the routines using vectorization technics like SIMD instructions. The principle of SIMD instruction is to handle multiple (e.g. 4 or 8) values at once instead of one on one instruction (for instance multiplication or addition) and thus considerable improving code performance.
The project is in collaboration with developers of GROMACS from the Molecular Simulation community, a code in which SIMD instruction operations are integrated and heavily used.
Project page: http://www.e-science.se/project/assembly-level-vectorization-compute-intensive-kernels-nek5000
Automation for profiling and code analysis in GROMACS
GROMACS is a high-performance and scalable code for molecular dynamics simulations, mainly used for studies of biomolecular systems. The codebase is highly tuned and takes advantages of different hybrid parallelization scenarios including internal thread-MPI implementation, MPI, OpenMP as well as CUDA, OpenCL and SIMD acceleration.
Proper analysis of performance and scalability issues in such codebase is a complex task. The current project aims to automate in GROMACS the usage of performance tools such as Extrae, and thus shorten the time for identification of critical performance issues.
People involved:
SESSI meeting at the SeRC room in PDC.
GPU acceleration & heterogeneous parallelism
(collaboration between Molecular Simulation community, EU project BioExcel and joint NVIDIA/KTH CUDA Research Center)
Project page: http://www.e-science.se/project/algorithms-molecular-dynamics-heterogeneous-architectures
Efficient MPI communication in NEK5000
The algorithms implemented in the CFD solver NEK5000 scale well to very large processor numbers. This could be demonstrated with strong scaling by using more than one million cores so far. Optimized communication routines using the Message Passing Interface (MPI) contribute this achievement.
Recent developments in the computer architecture provide increasing core numbers per processor as well as new capabilities in interconnect networks. This allows the implementation of communication operations with a high degree of parallelism. The aim of this activity is to exloit hardware features in the communication operations of NEK5000 in order to achieve a further increased program efficiency.
The project is a collaboration between the FLOW and DPT communities.
Fine-grained task-level parallelism for Exascale systems
Experience with hybrid MPI+OpenMP+CUDA with GROMACS has shown many complex issues of balancing the load of communication and computation across processing units. Implementations for biomolecular MD simulation benefit greatly from spatial decomposition of non-bonded workload, but the total workload is not homogeneous in space, and this cripples performance at high parallelism. Fundamentally, this is because the MD algorithm has been made to work in parallel in a way that guarantees a series of synchronization and serialization points in every time step. This situation will get worse in the future, as compute nodes have more processing units, more kinds of processing units to address, and more leaks from the von Neumann machine abstraction. Making best use of data locality will be an ongoing challenge.
To tackle these problems, we are refactoring GROMACS to use a much more fine-grained task parallelism. Tasks are mapped to suitable compute units in a way that is sensitive to the priority of the task, the hotness of the data in local memory, and the alternative tasks that are available. Fortunately, the overall compute task is well defined for a large series of time steps once neighbour searching is complete, so the large-scale DAG of data flow is static. Thus, the problem reduces to pre-organizing data flow so that all compute units can work efficiently upon the available task that is of highest priority.
SESSI meeting at the SeRC room in PDC.
An optimized FFT library for 3D real-valued small-size data
The use of Fast Fourier Transform (FFT) is ubiquitous in computational science. Its usage varies from solving Partial Differential Equations, to calculating convolutions, to performing spectral analysis. While several FFT libraries are already available, we focus on the development and implementation of an FFT library, specifically designed to solve the FFT of 3D real-valued small size (<= 128 element per dimension) data at maximum performance. For such cases, our goal is to achieve better performance of existing FFT libraries.
The most used FFT library by the scientific community is FFTW, an auto-tuned library developed at MIT in the Eighties. While FFTW has evolved during the last decades, it still doesn’t fully support all the relevant Single-Instruction Multiple-data (SIMD) sets, e.g. AVX, AVX512, IBM VSX, ARM Neon and it uses SIMD only within the calculation of one single FFT. Instead, our approach is based on using SIMD for solving several small (<= 128 elements) FFT in parallel on several SIMD sets. We focus on 3D and real-valued data as this case is very relevant to the solution of Particle Mesh Ewald (PME) in GROMACS.
The library is designed to be fast deployed in GROMACS for PME calculation and will provide a Fortran interface to be used in Nek5000.
People involved:
Stefano Markidis (PDC)
Vishnu Suresh Raju (PDC)
Xingjiang Yu (CST)
Mark Abraham (MolSim)
OpenACC for Nek5000
Overcoming I/O limitations on exascale architectures
Investigation of Communication Kernel in Nek5000
Runtime Profiling and Automatisation of Projections
Refactoring of Nek5000
Code optimization for on-node performance: SIMD and LIBXSMM for small dense matrix-matrix multiplications, streaming stores for optimizing cache to memory operations