Research Results

Software framework for massively parallel computations using the Combination Technique

In order to apply and test the combination technique on GENE, we have developed two software tools. A Python interface, which provides an easy access to GENE used for prototyping and testing, and a generic C++ framework intended for massively parallel simulations with GENE, but also with other parallel codes.
In the Python interface, we implemented the standard combination technique, fault-tolerant algorithms, and eigenvalue OptiCom techniques. That interface can be easily used for developing advanced combination technique schemes, and serves as a framework for testing them on small and medium sized GENE computations.
Our C++ framework efficiently runs time-dependent initial value computa- tions with GENE on massively parallel HPC systems, but also supports other application codes. It is part of the part sparse grid software library SG++. The framework implements the manager- worker pattern: The available compute resources, in the form of MPI processes, are arranged in process groups. A central manager process distributes the par- tial problems to the process groups. Each process group executes an instance of the application, e.g. GENE, in order to compute the partial problems assigned to it. The framework already successfully employs the communica- tion schemes and load-balancing techniques presented below and will be extended to include fault tolerance techniques, as well as support the optimized combination technique and the iterated defect correction.

Exa-challenges and solutions: Scalability

The coarse (second) level of parallelization decouples the original full-grid problem into many smaller, fully independent anisotropic subproblems, which are finally combined into one solution. In time-dependent computations, the corresponding partial solutions are recombined every few timesteps to avoid divergence. Even when divergence is not an issue, it might be necessary to calculate the combination solution every few timesteps to trace certain aspects of the solution field over time. For initial value simulation with GENE, for example, a typical quantity of interest is the evolution of the particles' distribution functions. The recombination requires global communication steps, which might affect the scalability on large parallel systems.
To solve this issue, we have developed global communication schemes for the combination technique that minimize the communication time by exploiting the hierarchical structure of the combination solution. Experiments on 456 nodes of the supercomputer Hermit (HLRS) were performed for different implementations of these schemes. A five-dimensional test problem was discretized up to level 13. Using the non-blocking collective operations of the MPI 3.0 standard, we could confirm a reduction in communication time of up to a factor of 72, compared to the naive algorithm Sparse Grid Reduce.


For the large-scale plasma turbulence simulations with GENE that we aim for, a single component grid will be so large that it has to be computed by a group of processes distributed over multiple nodes. In our latest works we in- vestigated the combination of such distributed component grids. We present experimental results that demonstrate the scalability of our combination al- gorithm on up to 180,224 cores on the supercomputer Hazel Hen. In our experiments we used 182 component grids that are computed on process groups of up to 8,192 processes (each process group computes several compo- nent grids one after each other). For the largest experiments we used 11 process groups which had 8,192 processes each. Our recombination algorithm is fast enough that it would even be possible to recombine a large GENE simulation after each time step.


Exa-challenges and solutions: Load balancing

In order to perform effective load balancing with the combination technique, a load model to estimate the runtime of the partial solutions is required. However, to achieve a good estimate of the runtime, this means considering not only different problem sizes associated with the different partial grids, but also the anisotropy of the discretization. GENE, for example, treats the dimensions differently when choosing the optimal domain decomposition for a particular number of processes. This is due to the fact that not every dimension can be decomposed with the same granularity while ensuring a high degree of efficiency. Furthermore, depending on the mode of operation, the discretization has large effects on the underlying numerics. Thus we have to consider the anisotropy of each grid, which strongly influences the convergence rate and stability of the underlying numerical solvers. Variations in the degree of anisotropy result in variable number of linear solver iterations, and smaller timestep sizes are necessary for very anisotropic discretizations, even for comparable number of unknowns. In GENE, execution times may vary by more than a factor of three for partial solutions with about the same number of unknowns but different degrees of anisotropy.
To address this issue, we developed a load model that attempts to estimate the execution time of a partial solution based on the number of unknowns and the anisotropy of the discretization. Using this model we were able to considerably increase the parallel efficiency of the second level of parallelization compared to a basic model that considers only the number of unknowns.


Exa-challenges and solutions: Fault tolerance

Through the joint collaboration with research associate Hegland's group in Australia, important steps have been taken in the development of the theory of a fault-tolerant combination technique (FTCT). The most recent results in the theory of the FTCT have allowed us to successfully apply these resilience techniques to GENE, thus showing its robustness beyond the usual toy problems (e.g., 2D convection or Poisson equations). The underlying idea is to exclude partial combination solutions that go lost due to hardware or software failures (unless recomputation of the grid is affordable), and to find a new combination of the successfully calculated grids. We have carried out cost and error analyses for various simulation scenarios of GENE using this fault tolerance approach, including the time-dependent study of instabilities driven by a trapped electron mode (TEM) and the ion temperature gradient (ITG). This algorithmic approach saves us the cost of performing checkpointing, and seems to be a viable approach to fault tolerance on future HPC systems. The figures show how the combination solution is affected by (simulated) process faults, and how the FTCT can recover the original solution by simply using different weights of the combination grids and a negligible recomputation overhead.

     
GENE PDF: combination technique solution with 20 grids    GENE PDF: after 5 grids go missing    GENE PDF: recovery by FTCT, excluding failed grids

We incorporated the FTCT into our software framework and tested it for a 5-dimensional advection-diffusion equation implemented in DUNE. When faults occur within a process group, it is detected by the process manager at the next global communication step. The manager can then exclude the dead group from the communication and redistribute the failed tasks among the living groups. Only in some cases a small fraction of the component solutions has to be recomputed, but the total overhead to ensure fault tolerance is very small. We have shown that this approach works well using the Hazel Hen supercomputer with up to 65 000 processes.

  
Scaling experiments for a 5D combination technique with faults    Convergence of the combination solutoin after fault recovery


The combination technique can also be used to detect silent data corruption (SDC) by exploiting the data redundancy among the component solutions. Through further collaboration with the ANU Canberra we have developed algorithms to detect if component solutions have been affected by SDC. If detected, we can apply the FTCT to recover a good combination technique, excluding wrong component solutions.

     
One component solution affected by SDC    The resulting combination technique    Affected solutions can be detected


Eigenvalue problems in GENE

In the classical combination technique, the combination coefficients are fixed, which does not allow a direct combination of eigenmodes of GENE, since they are subject to a phase shift. We developed a new technique that does not rely on an operator bound norm. Instead, the eigenvalue problem is formulated as an optimization problem, in which the norm of the residual is minimized. The minimization adapts the combination coefficients. This so-called OptiCom method for eigenvalues not only combines the partial solutions to obtain a good approximation. It can also identify bad partial approximations, which can occur due to the anisotropic grids used, where some physical effects are not resolved well enough. With the OptiCom, these partial approximations obtain very small combination coefficients and they are thus effectively disregarded for the combined solution.


Numerics

A crucial ingredient towards convergent and reliable numerical methods for high-dimensional problems in exascale computing is an in-depth theoretical and practical analysis of the proposed numerical techniques. While the ultimate objective is a deep theoretical understanding of the full application problem, we started the theoretical analysis with simplified model problems.
Such a problem is the elliptic boundary value problem in weak formulation. While there is convergence theory for the solution of elliptic PDEs by sparse grid methods and by the sparse grid combination technique, the proposed iterated (optimized) combination technique has not been studied in theoretical and numerical terms, so far. Such an analysis has been achieved by PI Griebel and collaborators during the last year. Here, subspace correction approximations of higher-dimensional elliptic problems were carried out comparing the use of different scaling parameter approximations for the respective subspaces. It turned out that the iterated optimized combination technique is indeed an optimal candidate for approximations of higher-dimensional elliptic problems. The iterated optimized combination technique achieves the best convergence results compared to other iterative combination technique schemes for a four-dimensional elliptic problem.