Research Goals

In the second funding period we will cover especially four main components. Our progress in numerical methods for higher-dimensional problems in HPC will enable novel approaches and algorithms to tackle the exascale-challenges of scalability, resilience and load balancing. These newly developed algorithms will extend our software framework for the computation of higher-dimensional problems on massively parallel systems. The results in the first funding period have shown that the combination technique provides the means for exciting new approaches to algorithm-based fault tolerance even for silent faults without checkpoint-restart.Thus, we will put a special focus on resilience on all levels of parallelization. The plasma turbulence code GENE will serve as a representative application for higher-dimensional problems with inherent need for exascale resources.

Exa-challenges: Scalability

  • Demonstrate scalablity of our algorithms on full supercomputer

Exa-challenges: Load balancing

  • New strategies to refine our load models at runtime.

Exa-challenges: Resilience

  • Massively parallel simulations with the fault-tolerant combination technique (FTCT). To test our algorithms we will simulate (hard) faults on HPC Systems using our home grown fault simulation layer.
  • Detect and correct errors through Silent Data Corruption (SDC) with the combination technique.
  • Application-level resilience and fault-tolerant alternatives to standard MPI. We investigate libraries (e.g. ULFM) and techniques allowing the development of a fault-tolerant domain decomposition of parallel applications, targeting the GENE code.
  • Numeric-based approaches to resilience, e.g. by the iterated combination technique or randomized subspace correction.

Numerics of the combination technique

  • Apply optimized Combination Technique (OptiCom) to larger and more complex simulation scenarios in GENE, e.g. global non-linear simulations.
  • Extend theory for Finite Differences on Sparse Grids.
  • Extend theory of the iterated combination technique by convergence theory and numerical studies of Vlasov-type problems.
  • Extend theory of residual (subspace) correction schemes towards strongly formulated problems (adapt finite-element based iterated combination technique to finite-difference type discretizations).

Application and application software

  • Further extend our combination technique module of SG++, e.g. by algorithm-based fault-tolerance and new algorithms for iterated and optimized combination technique.
  • Develop fault simulation software layer that emulates the behaviour of a fault-tolerant MPI implementation (e.g. ULFM) and allows us to simulate crashed processes. This allows us to carry out large-scale experiments with the fault-tolerant combination technique on a supercomputer using the system's native MPI implementation.