iSGTW Feature - Embrace failure - TeraGrid fault tolerance workshop

Feature - Embrace failure!

Don Lamb addressing the Fault Tolerance for Extreme Scalability Workshop, co-sponsored by The National Science Foundation Office of Cyberinfrastructure's Blue Waters and TeraGrid projects.

Image courtesy of TeraGrid External Relations.

Can smart checkpoints and fault-resilient applications avert a Malthusian Catastrophe?

As more powerful systems encompass ever-increasing numbers of components, even a small fault rate on individual processors will generate multiple faults across the components, stopping long-running applications in their tracks.

At a workshop in March, U.S. experts met to discuss issues relating to the fault-tolerance of today's and tomorrow's petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults.

"It is invaluable for the systems specialists, middleware designers, and applications scientists to share their experiences and to talk about their expectations for other parts of the HPC ecosystem. This is the only way we will know what works, what doesn't work, and what we still need to do," said Daniel S. Katz, TeraGrid Grid Infrastructure Group Director of Science and lead organizer of the workshop.

While sharing her experiences with Kraken, TeraGrid's largest supercomputer, Patricia Kovatch of the National Institute for Computational Sciences made an analogy with Thomas Malthus' famous prediction about geometric population growth (as the population gets bigger, it grows faster) versus a constant rate of growth in agricultural output. She claims that a similar dichotomy exists between the growth in application size and system complexity, and the rate of improvement in failure mitigation techniques.

"To stave off this Malthusian Catastrophe," she said, "we are leveraging some of the same techniques that agriculture has: concentrating resources and making large infrastructure investments, developing wider markets and better distribution networks, and implementing more efficient technologies."

Don Lamb, a University of Chicago professor and Director of the ASC/Alliance Flash Center, presented experiences from three production runs of simulation software, called FLASH, used by scientists in fields such as cosmology and plasma physics.

"FLASH handles astronomically large ranges of values of physical quantities, and operates at the upper level of available memory," said Lamb. "Consequently, it has walked into almost every hardware or software limitation in the high end systems."

"A checkpoint/rollback capability is in place," he said, referring to a feature that saves a snapshot of a job's progress from which it can be restarted at a later time. "But it is controlled by the application, which has no way of detecting imminent component failures. If a failure happens just before checkpointing, rollback can be expensive." He suggested a solution, called Fault Tolerance Backplane, that could keep the application informed about the state of the machine and use this knowledge to write a checkpoint before an imminent failure, thereby avoiding the expensive recovery scenario.

John Daly, addressing the workshop.

Image courtesy of TeraGrid External Relations.

Several tool and application developers and other systems specialists shared their experiences regarding faults and resiliency, methodologies for acceptance testing, and performance metrics that recognize inevitable events such as chassis failure, boot failure, silent corruption, and more.

John Daly of the Research Directorate at the National Security Agency currently leads an effort on resilience for the Advanced Computing Systems research program. He advocates a focus shift from fault-tolerance in systems to resilience in applications.

Daly outlined three problems he sees in fault-tolerance approaches. First, as the number and density of components increases, so do the system faults, and recovery-based fault-tolerance is approaching a theoretical limit. Second, redundancy-based schemes increase the share of resources dedicated to fault recovery. Third, silent failure modes - intolerable for many application users - reduce monitoring effectiveness, and hence both application progress and certainty of correctness.

Resilience, on the other hand, an application-centric paradigm, aims to protect applications from data corruption and Byzantine faults, Daly said. It aims to do so in a timely and efficient manner (considering tradeoffs in power, productivity and performance) and in the presence of hardware or software degradations and failures.

"Fault tolerance uses redundancy and replication to recover from failure," he said. "Resilience offers a more integrated approach in which the system works with applications to keep them running in spite of component failure."

-Elizabeth Leake, TeraGrid, and Anne Heavey, iSGTW

A report of the proceedings will be available on the TeraGrid Web site.

 Please fill out a short questionnaire about your fault treatment mechanisms for Alexandre Duarte, a Ph.D. student in Computer Science at the Universidade Federal de Campina Grande in Brazil who is investigating this topic in the context of the EELA-2 and OurGrid projects.