Organizational Structures in Massively Distributed Systems
Team Members: Christoph Ertl, Ernst Rank
Duration: 01.2016-12.2021
Background
One of the main challenges in numerical computing on modern high performance clusters for the simulation of real world phenomena is the efficient handling and management of the simulation domain that is usually distributed among computational resources. This includes the subdivision of the domain, its distribution, and the continued balancing of workload per computational resource during runtime – all while minimising the communication between the participating units. Furthermore, the solution process of the emerging system of equations has to be synchronised among all parts of the domain, especially when using schemes like multigrid methods.
Classical approaches may store the complete topology with each unit. This entails a significant memory requirement and expensive global communication overhead to ensure consistency when the domain configuration is altered for example through AMR or user interaction during runtime. Therefore, such an approach is mostly limited to simple uniform domain configurations with similar physical models. More sophisticated techniques use a subset of the participating processes which act as bookkeeping instances for the domain. While reducing the memory requirement as well as the costs of the global communication, the overhead when communicating with the central bookkeeping instance cannot be circumvented and grows with the size of the systems used.
The current trend of systems with growing number of processors while either keeping the memory per process constant or even decreasing, is assumed to worsen the aforementioned bottlenecks further.
Decentral Organisational Structures
This work addresses the aforementioned shortcomings by employing a decentral approach to domain organisation. The essential idea is to limit the domain view of each participating unit to their direct neighbours. Transfer of data and updates of topology are only realised between them, hence global updates are not necessary. Since there is an upper bound to the amount of neighbours each subdomain can have, regardless of total domain size, this approach promises to scale even when computing on the largest clusters.
The decentral management facilities are being implemented on top of a code framework for fluid flow simulations specifically designed for massive parallel computers featuring a dedicated hierarchic data structure. Here, coarser domain representations stemming from an octree-based domain generation are not discarded, but kept and used in a custom tailored multigrid-like solver. Neighbourhood relations thus exist in a hierarchical sense in addition to a geometrical sense.
The new approach affects various parts of the numerical code significantly. Primarily, the communication between neighbouring domains, i.e. processes that hold parts of the domain that are hierarchic or geometric neighbours. One challenge is to find pairs of processes to inform about domain alterations, preferably in various orders to remedy slower, more burdened processes.
The octree-based domain generation had to be revamped from a central based one to support the decentral structure similarly. To this end, an algorithm has been devised which refines an input geometry on all processes up to a predetermined depth, before distributing the leaf domains to be further refined on their respective processes.
The fluid solver has been extended to asynchronous iterative matrix solvers in combination with additive multigrid methods. The structure of the code lends itself well and greatly benefits from the asynchronity due to its distributed nature.
Load-balancing becomes more involved as the mainly used strategy, namely Space-Filling-Curves rely on complete topological knowledge. Therefore, a heuristic concept is proposed that determines the best targets for the transfer of sub-domains to computing units, weighting an optimised balance in terms of computational work with minimising communication cost when applying stencil based computational kernels.
Further topics include I/O based on HDF5 in which all domains collectively write to one file as well as an online visualisation technique which allows to stream data with constant bandwidth while altering the visualisation window and the resolution of the displayed data during runtime.