Fault-tolerant computing options based on the use of restart information stored on and off node and the use of reserve processes have been developed, implemented and tested in a large-scale, production field solver taken from the domain of computational fluid dynamics. The tests conducted to date have shown good results, with recovery rates approaching 100% under realistic node failure scenarios. Even though the computational overhead of the field solver is very low (explicit time-marching and finite differences), the fault-tolerant implementation adds a run-time penalty that is only in the range of 6–12%, depending on the spatial and temporal approximation used. The procedures developed are generally applicable, and could easily be ported to other codes.

Back to Top

Document information

Published on 01/01/2020

DOI: 10.1080/10618562.2020.1773448
Licence: CC BY-NC-SA license

Document Score


Views 0
Recommendations 0

Share this document

claim authorship

Are you one of the authors of this document?