Research Goals
In the second funding period we will cover especially four main components. Our progress in numerical methods for higher-dimensional problems in HPC will enable novel approaches and algorithms to tackle the exascale-challenges of scalability, resilience and load balancing. These newly developed algorithms will extend our software framework for the computation of higher-dimensional problems on massively parallel systems. The results in the first funding period have shown that the combination technique provides the means for exciting new approaches to algorithm-based fault tolerance even for silent faults without checkpoint-restart.Thus, we will put a special focus on resilience on all levels of parallelization. The plasma turbulence code GENE will serve as a representative application for higher-dimensional problems with inherent need for exascale resources.
Exa-challenges: Scalability
- Demonstrate scalablity of our algorithms on full supercomputer
Exa-challenges: Load balancing
- New strategies to refine our load models at runtime.
Exa-challenges: Resilience
- Massively parallel simulations with the fault-tolerant combination technique (FTCT). To test our algorithms we will simulate (hard) faults on HPC Systems using our home grown fault simulation layer.
- Detect and correct errors through Silent Data Corruption (SDC) with the combination technique.
- Application-level resilience and fault-tolerant alternatives to standard MPI. We investigate libraries (e.g. ULFM) and techniques allowing the development of a fault-tolerant domain decomposition of parallel applications, targeting the GENE code.
- Numeric-based approaches to resilience, e.g. by the iterated combination technique or randomized subspace correction.
Numerics of the combination technique
- Apply optimized Combination Technique (OptiCom) to larger and more complex simulation scenarios in GENE, e.g. global non-linear simulations.
- Extend theory for Finite Differences on Sparse Grids.
- Extend theory of the iterated combination technique by convergence theory and numerical studies of Vlasov-type problems.
- Extend theory of residual (subspace) correction schemes towards strongly formulated problems (adapt finite-element based iterated combination technique to finite-difference type discretizations).
Application and application software
- Further extend our combination technique module of SG++, e.g. by algorithm-based fault-tolerance and new algorithms for iterated and optimized combination technique.
- Develop fault simulation software layer that emulates the behaviour of a fault-tolerant MPI implementation (e.g. ULFM) and allows us to simulate crashed processes. This allows us to carry out large-scale experiments with the fault-tolerant combination technique on a supercomputer using the system's native MPI implementation.