• Subscribe

iSGTW Feature - Turbocharge your job submissions

Feature - Turbocharge your job submissions!

December 2007 - October 2008 plot of Falkon across various systems (ANL/UC TG 316 processor cluster, SiCortex 5832 processor machine, IBM Blue Gene/P 4K and 160K processor machines). Over the past year, Falkon has seen wide deployment and usage across a variety of systems, from the TeraGrid, the SiCortex at Argonne National Laboratory, the IBM Blue Gene/P supercomputer at ALCF ANL, and the Sun Constellation supercomputer from the TeraGrid.

Each blue dot in the figure represents a 60 second average of allocated processors, and the black line denotes the number of completed tasks.

In summary, there were 163K peak number of processors, with 1.4 million CPU hours consumed and 164 million tasks for an average task execution time of 31 seconds.

Image courtesy of Ioan Raicu.

(Editor's note: Ioan Raicu and Ian Foster, both of the University of Chicago and Argonne National Laboratory, contributed this article.)

Applications that run thousands of jobs can cause headaches. Huge numbers of job submissions to a site often cause bottlenecks, make system administrators grumpy, and worse, bring down remote gateway nodes, rendering the resources useless and losing jobs in the process. Traditional techniques commonly used in the scientific community do not scale to today's - let alone tomorrow's - largest grids and supercomputers. But the new class of applications called Many Task Computing, discussed in the recent article "Many Task Computing: Bridging the performance-throughput gap" has spawned development of a new framework, called Falkon, that enables applications to scale up quite painlessly and use these large systems efficiently.

Minutes to milliseconds

Falkon (Fast And Light-weight tasK executiON) is designed to help restructure applications to reduce job wait time, network bandwidth and job submission overheads from minutes to milliseconds. It leaves many of the higher overhead features such as accounting and persistency, for the local resource managers or the applications to handle. Falkon focuses on efficient handling of many independent tasks on large-scale distributed systems with many processors.

Falkon has demonstrated vast improvements in performance and scalability for a wide variety of tasks - tasks with execution times ranging from milliseconds to hours, compute- and data-intensive tasks, and tasks with varying arrival rates. The improvements extend across diverse applications from astronomy to medicine, economic modeling and beyond, and to scales of billions of tasks on hundreds of thousands of processors.

One researcher who adopted Falkon is Andrew Binkowski at the Midwest Center for Structural Genomics at Argonne National Laboratory. Binkowski and his team model three-dimensional protein structures in their basic research towards drug design. Since proteins with similar structures tend to behave in similar ways, the team compares the modeled structures to existing, known proteins in order to predict their functions -- a computationally intensive task.

"As the Protein Data Bank (a repository of known proteins) expands almost exponentially, it becomes more difficult to coax desktop machines to do the types of analysis required," says Binkowski. "We turned to Falkon as a way to utilize our existing software applications."

Falkon's distributed architecture, where the task dispatchers can be distributed over many nodes partitioning the compute resources into smaller pools to improve overall system throughput and scalability; for example, on the IBM Blue Gene/P supercomputer at full 160K processor scale, the typical configuration is to run 1 client, 640 dispatchers each managing 256 executors, for a total of 160K executors.

Image courtesy of Ioan Raicu.

What makes Falkon fly faster

The Falkon framework uses three novel techniques to enable rapid and efficient job execution and to improve application performance and scalability. Multi-level scheduling, in which resource allocation for a job is separated from job dispatch, enables on-the-fly resource allocation and minimizes the wait queue times. Secondly, Falkon's distributed streamlined task dispatcher achieves from ten to a thousand times the dispatch rates that conventional centralized schedulers do. Third, Falkon's data-aware scheduler can coordinate tasks and data so that the data transfer is minimized from shared or parallel file systems and across the network.

We can ask bigger questions

"Falkon has allowed us to ask bigger questions and perform experiments on a scale never before attempted - or even thought possible," says Binkowski. "This is the difference between comparing a newly determined protein structure to a family of related proteins versus comparing it to the entire protein universe."

The team has done all of this using existing software packages that were not designed for high-throughput computing or many-task computing, and used Falkon to coordinate and drive the execution of many loosely-coupled computations that are treated as "black boxes" without any application-specific code modifications.

"Whereas identifying similarities in protein binding pockets (for protein structure analysis) is characterized by millions of discrete jobs taking seconds to complete, docking and scoring a small-molecular compound (for drug discovery) can require several hours to converge on a solution. In both cases, we are able to tailor our workflows to achieve the best possible scientific results and still get the throughput and efficiency we need to take advantage of the large computing resources we have available."

-Ioan Raicu and Ian Foster

Join the conversation

Do you have story ideas or something to contribute?
Let us know!

Copyright © 2015 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.