• Subscribe

iSGTW Feature - Adaptive fault tolerance for improved reliability

Feature - Adaptive fault tolerance for improved reliability

Image generated using cosmology modeling software Enzo. FT-Pro is designed to help programs like Enzo run for longer by adaptively selecting a best‐fit action in response to failure prediction.
Image courtesy of Mike Norman at University of California, San Diego and Greg Bryan at Columbia University, New York, U.S.

So there's the good news. And then there's the bad news.

The good news is that high performance computing systems are getting bigger.

And the bad news? As system size increases, Mean Time Before Failure is dramatically reduced: the number of hours you can run your application before everything grinds to a halt just keeps getting smaller.

Yawei Li and Zhiling Lan of the Illinois Institute of Technology, U.S., want to change all that.

They have developed an adaptive fault management scheme, called FT-Pro, which has already improved the robustness of several real-world applications run on the TeraGrid, including Enzo, a software package for simulating cosmological structures, and GROMACS, a molecular dynamics package for studying molecular interactions.

"Applications like these are getting larger, running for longer, and using more processors," Lan says. "But, since just one process failure can crash your entire application, these applications are extremely vulnerable to failure."

The usual solution, says Lan, is either to undertake regular reactive checkpointing, or to be proactive and predict potential failures before they occur. Both options are fraught with complications.

"Regular checkpointing results in substantial performance overhead, while predicting failure can be very hit-and-miss. We wanted something that could combine the best of both these approaches."

FT-Pro is Lan's solution. The program works in conjunction with regular failure management tools, but introduces the flexibility of adaptive decision making: FT-Pro can make runtime decisions based on a user's fault tolerance requests.

Zhiling Lan is working to increase Mean Time Before Failure by introducing adaptive fault tolerance.
Image courtesy of Zhiling Lan

"We would like to see FT-Pro used to help avoid anticipated failures, and to help applications tolerate unforeseeable failures, so that the impact of any failure is kept to a minimum," explains Lan.

The system works by allocating a couple of spare nodes, used as an extra hand to juggle jobs on and off nodes where failure is predicted.

Usually kept idle, these spare nodes provide the luxury of migration away from failing nodes, buying some downtime for their recovery or restart, and thus minimizing application execution times.

Trace-based experiments on the IA32 Linux cluster at Argonne National Laboratory (part of TeraGrid) have indicated that FT-Pro can effectively improve the performance of parallel applications in the presence of failures by avoiding anticipated failures and skipping unnecessary fault tolerance overhead.

For example, when running Enzo, using FT-Pro on the 96-node IA32 TeraGrid/ANL cluster reduced application completion time by up to 43%, as compared to when purely relying on periodic checkpointing.

FT-Pro is supported in part by the United States National Science Fund, IIT startup fund, and TeraGrid Wide-Roaming Allocation.

- Cristy Burne, iSGTW

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2023 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.