Opinion - Challenges to exascale computing
Supercomputing has been a major part of my education and career, from the late 1960s when I was doing atomic and molecular calculations as a physics doctorate student at the University of Chicago, to the early 1990s when I was general manager of IBM's SP family of parallel supercomputers.
The performance advances of supercomputers in these past decades have been remarkable. The machines I used as a student in the 1960s probably had a peak performance of a few million calculations per second or megaflops. Gigaflops (billions) peak speeds were achieved in 1985, teraflops (trillions) in 1997, and petaflops (a 1 followed by fifteen zeros) in 2008.
The supercomputing community is now aiming for exascale computing, - 1,000,000,000,000,000,000 calculations per second. The pursuit of exascale-class systems was a hot topic at the recent SC09 supercomputing conference.
In the quest for the fastest machines, supercomputers have always been at the leading edge of advances in IT, identifying the key barriers to overcome and experimenting with technologies and architectures that generally then appear in more commercial products a few years later.
Through the 1970s and 1980s, the fastest supercomputers were based on vector architectures and used highly sophisticated technologies and liquid cooling methods to remove the large amounts of heat they generated.
By the late 1980s, these complex and expensive technologies ran out of gas. As the microprocessors used in personal computers and technical workstations were becoming increasingly powerful, you could now build supercomputers using these CMOS micros and parallel architectures at a much lower price than the previous generations of vector machines. A similar transition to microprocessor components and parallel architectures took place in the mainframes used in commercial applications.
Massively parallel architectures, using tens to hundreds of thousands of processors from the PC and Unix markets have dominated supercomputing over the past twenty years. They got us into the terascale and petascale ranges. But, they will not get us to exascale. Another massive technology and architectural transition now looms for supercomputing and the IT industry in general.
Anticipating the major challenges involved in the transition to exascale, the Department of Energy (DOE) and DARPA launched a series of activities around three years ago to start planning for such systems.
This DARPA ExaScale Computing Study provides a very good overview of the key technology challenges. The study identified four major challenges where current trends are insufficient, and disruptive technology breakthroughs are needed to make exascale computing a reality.
The Energy and Power Challenge is pervasive, affecting every part of the system. Today's leading edge petascale systems consume between 2 - 3 Megawatts (MW) per petaflop. It is generally agreed that an exaflop system must consume around 20 MWs, otherwise their operating costs would be prohibitively expensive. The 1000-fold increase in performance from petascale to exascale must thus be achieved at no more than a 10-fold increase in power consumption.
Such stretch targets were actually achieved in the transition from terascale in the late 1990s to petascale now. But no one believes it can be done again with today's technologies, hence the assumption that a technology and architectural transition as profound as the one two decades ago ago is now required.
The Memory and Storage Challenge is a major consequence of the power challenge. The currently available main memories (DRAM) and disk drives (HDD) that have dominated computing in the last decade consume way too much power. New technologies are needed.
The Concurrency and Locality Challenge is another consequence of the power challenge. Over the past twenty years we have been able to achieve performance increases through a combination of faster processes and higher levels of parallelism. But, we are no longer able to increase the performance of a single processing element by turning up the clock rate due to power and cooling issues. We now have to rely solely on increased concurrency.
The top terascale systems of ten years ago had roughly 10,000 processing elements. Today's petscale system are up in the low 100,000s. But, because, the only way to now increase performance toward an exascale system is massive parallelism, an exaflop supercomputer might have 100s of millions of processing elements or cores. Such massive parallelism will require major innovations in the architecture, software and applications for exascale systems. This DARPA Exascale Software Study provides a good overview of the software breakthroughs required.
Finally, we have the Resiliency Challenge, that is, the ability of a system with such huge number of components to continue operations in the presence of faults. An exascale system must be truly autonomic in nature, constantly aware of its status, and optimizing and adapting itself to rapidly changing conditions, including failures of its individual components. The exascale resiliency challenges are discussed in this DARPA report on System Resiliency at Extreme Scales.
There are vast business implications to such a massive technology and architectural transition. For one, the ecosystem of the past twenty years, where PCs have provided the components for parallel supercomputers, is now giving way to a new business ecosystem. Consumer electronics, mobile devices and embedded sensors are now the new partners of the extreme scale supercomputing community, because they share the same requirements for plentiful, powerful and inexpensive components that consume little power.
This transition to a new ecosystem already started about five years ago. IBM's Blue Gene family uses relatively low power, embedded cores as its processing elements, and Roadrunner's hybrid design includes the Cell processors originally developed for Sony's PlayStation 3.
The most powerful supercomputers in the US have generally been developed for, and first installed at DOE's national labs, either as part of its Advanced Scientific Computing Research (ASCR) program in support of energy and environmental research, or the Advanced Simulation and Computing program in support of nuclear weapons research. These DOE labs typically work closely with the vendors in the requirements and design of such leading edge supercomputers.
To begin to understand the requirements for exascale machines, ASCR sponsored a series of town hall meetings, which held open discussions on the most critical and challenging problems in energy, the environment and basic science. These meetings where then followed by a series of technical workshops, each focusing on a specific scientific domain.
The DOE town halls and workshops have identified the opportunity for exascale computing to revolutionize the way we approach challenges in energy research, environmental sustainability and national security. They also identified the impact of exascale computing on key science areas like biology, astrophysics, climate science and nuclear physics.
One of their most compelling conclusions is that with exascale computing, we are reaching a tipping point in predictive science, an area with potentially major implications across a whole range of new, massively complex problems. Let me explain.
High end supercomputers are generally designed for either capability or capacity computing. Capability supercomputers dedicate the whole (or most of the) machine to solve a very large problem in the shortest amount of time. Capacity supercomputers, on the other hand, support large numbers of users solving different kinds of problems simultaneously.
While both kinds of supercomputing are very important, initiatives designed to push the envelope, like DOE's exascale project, tend to focus on the development of capability machines to address Grand Challenge problems like those mentioned above, that could not be solved in any other way.
Capability computing has been primarily applied to what is sometimes referred to as heroic computations, where just about the whole machine is applied to a single task. And, without a doubt, there are quite a number of problems that we will be able to address with machines 1000 times more powerful.
But, at least as exciting, is the potential for exascale computing to address a class of highly complex problems that have been beyond our reach, not just due to their sheer size, but because of their inherent uncertainties and unpredictability. The way to deal with such uncertainty is to simultaneously run multiple ensembles or copies of the same applications, using many different combinations of parameters, and thus be able to explore the solution space of these otherwise unpredictable problems. This will let us search for optimal solutions to many problems in science and engineering, as well as enable us to calculate the probabilities of extreme events.
This new style of predictive modeling will help us apply more scientific methodologies to many kinds of problems, from climate studies to the design of safe nuclear reactors. Beyond science and engineering, there are many disciplines that will benefit from such predictive capabilities, from economics and medicine to business and government.
Ensemble computing has attributes of both capability and capacity computing. It devotes the whole machine to one problem, but it does so by running many copies of the problem in parallel with different initial conditions. Innovative techniques are already emerging to help developers better program and manage such ensem
Finally, it is important not to underestimate the impact of exascale breakthroughs to more capacity oriented machines, as well as to smaller machines that share the same technologies, architecture, software and applications. Many of the innovation that will enable us to develop exascale class supercomputers will yield relatively inexpensive petascale class systems as well as smaller ones. The wider the access to such families of systems, the richer the overall ecosystem including applications, users and technologies.
In addition, these same exascale innovations will find wide usage in the more commercially oriented cloud computing systems. The technology requirements are quite similar, especially the need for low power, low cost components. They also share similar requirements for highly efficient, autonomic system management. One can actually view cloud-based systems as a kind of exascale class supercomputers designed to support embarrassingly parallel workloads, such as massive information analysis or huge numbers of sen
sors and mobile devices.
In its Strategy for American Innovation, the Obama administration listed exascale computing among the Grand Challenges of the 21st century in science, technology and innovation, that "will allow the nation to set and meet ambitious goals that will improve our quality of life and establish the foundation for the industries and jobs of the future." It explicitly called for
"An exascale supercomputer capable of a million trillion calculations per second - dramatically increasing our ability to understand the world around us through simulation and slashing the time needed to design complex products such as therapeutics, advanced materials, and highly-efficient autos and aircraft."
This is a truly exciting and important challenge.
This article originally appeared in Wladawsky-Berger's blog, and is reprinted here with his permission. You can join in on discussions about the opinions expressed in this article at the original blog post, here.