Research Goals

In the second funding period we will cover especially four main components. Our progress in numerical methods for higher-dimensional problems in HPC will enable novel approaches and algorithms to tackle the exascale-challenges of scalability, resilience and load balancing. These newly developed algorithms will extend our software framework for the computation of higher-dimensional problems on massively parallel systems. The results in the first funding period have shown that the combination technique provides the means for exciting new approaches to algorithm-based fault tolerance even for silent faults without checkpoint-restart.Thus, we will put a special focus on resilience on all levels of parallelization. The plasma turbulence code GENE will serve as a representative application for higher-dimensional problems with inherent need for exascale resources.

Exa-challenges: Scalability

Demonstrate scalablity of our algorithms on full supercomputer

Exa-challenges: Load balancing

New strategies to refine our load models at runtime.

Exa-challenges: Resilience

Massively parallel simulations with the fault-tolerant combination technique (FTCT). To test our algorithms we will simulate (hard) faults on HPC Systems using our home grown fault simulation layer.
Detect and correct errors through Silent Data Corruption (SDC) with the combination technique.
Application-level resilience and fault-tolerant alternatives to standard MPI. We investigate libraries (e.g. ULFM) and techniques allowing the development of a fault-tolerant domain decomposition of parallel applications, targeting the GENE code.
Numeric-based approaches to resilience, e.g. by the iterated combination technique or randomized subspace correction.

Numerics of the combination technique

Apply optimized Combination Technique (OptiCom) to larger and more complex simulation scenarios in GENE, e.g. global non-linear simulations.
Extend theory for Finite Differences on Sparse Grids.
Extend theory of the iterated combination technique by convergence theory and numerical studies of Vlasov-type problems.
Extend theory of residual (subspace) correction schemes towards strongly formulated problems (adapt finite-element based iterated combination technique to finite-difference type discretizations).

Application and application software

Further extend our combination technique module of SG++, e.g. by algorithm-based fault-tolerance and new algorithms for iterated and optimized combination technique.
Develop fault simulation software layer that emulates the behaviour of a fault-tolerant MPI implementation (e.g. ULFM) and allows us to simulate crashed processes. This allows us to carry out large-scale experiments with the fault-tolerant combination technique on a supercomputer using the system's native MPI implementation.

EXAHD

project in the German Priority Programme 1648 - Software for Exascale Computing

Contact

Dirk Pflüger dirk.pflueger@ipvs.uni-stuttgart.de