Scalable Performance Monitoring

Understanding how parallel applications behave is crucial for using HPC resources efficiently. Particularly, exascale systems will be composed by heterogeneous architectures with multiple levels of concurrency and energy constraints. In such complex scenarios, performance monitoring and runtime systems will play a major role to obtain good application performance and scalability. SeRC researchers have developed techniques for online access to performance data and efficient data formats for performance data.

The task of performance analysis becomes increasingly difficult due to the growing complexity of scientific codes and the scale of machines. Even though many tools have been developed over the past years to help in this task, current approaches either only offer an overview of the application discarding temporal information, or they generate huge trace files that are often difficult to handle.

To cope with these issues we developed techniques for online access to performance data that can for instance be exploited by intelligent dynamic runtime systems, as well as novel trace compression techniques within the SeRC OpCoReS project (http://www.e-science.se/project/opcores-optimized-component-runtime-system)

Online access to system and application performance data will be a necessity to decide how to schedule resources and orchestrate computational elements: processes, threads, tasks, etc. To gain access to this data we developed the Performance Introspection API, an extension of the IPM tool that provides runtime access to performance data from an application while it runs. Our technique is highly scalable and only involves minimal overhead. We tested our Performance Introspection API for instance for processor frequency scaling to reduce power consumption.

Event Flow Graphs are a new approach for monitoring MPI applications that balances the low overhead of profiling tools with the abundance of information of tracers. The event flow graphs are captured with very low overhead, require orders of magnitude less storage, and can still recover the full sequence of events in the application.

References: 

• X. Aguilar, K. Fuerlinger and E. Laure. Online Performance Data Introspection with IPM. In 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013), Zhangjiajie, China, November 2013.

• X. Aguilar, K. Fuerlinger and E. Laure. Trace Compression and Replay using MPI Event Flow Graphs. In EuroPar 2014, Porto, Portugal, August 2014.