The term "high throughput parallel computing" has been turning up at more and more conferences. What does it mean, and what does a grid have to go through to make it possible? We turned to Daniel Fraser, production coordinator for the Open Science Grid and senior fellow at the University of Chicago's Computation Institute, to learn more about HTPC.
iSGTW: High throughput jobs are usually done in parallel. What is different about high throughput parallel computing?
Fraser: Let's define some terminology first. We often talk about multi-core processors. On high performance computing systems, a common packaging practice is to put two (or four) processors connected to a local pool of memory onto a single board. We refer to this combination of cores + memory package as a "machine." For example, a typical machine on the OSG consists of dual quad core processors and 16GB of memory. HTPC jobs reserve the whole machine. This allows one job to utilize all the cores and all the memory of a machine.
As for parallelism, there is parallelism at two levels. First, users submit multiple "whole machine" jobs in parallel. This is referred to as high throughput parallel computing or HTPC. Second, a job that uses an entire machine may be "parallelized" to use more than one core, although some jobs utilize only a single core, but use the memory.
iSGTW: Is it wasteful to let some of the cores sit idle?
Fraser: Not necessarily. There is always a balance with high performance computing. Some jobs require lots of cores, some jobs require lots of memory, some jobs require multiple arithmetic capabilities (such as a simultaneous mult + add), some jobs require access to high speed backplane networks, some require access to GPUs. Computers are meant to serve the researcher and if a researcher needs particular capabilities, then they need those capabilities.
To come back to my earlier example, since memory is more expensive than CPUs, one could argue that reserving a whole machine to access all the memory and run on only one core is a cost effective use of the high performance computing system. There is always something going unused. That is the nature of computing. In the end, getting the user's job run is what matters.
iSGTW: Without HTPC, what happens to the extra cores on a computing node?
Fraser: Under normal high throughput computing, jobs are scheduled to run as one per core.
iSGTW: Why do researchers using grid computing need HTPC? Are there any particular fields of research that need this capability more than others?
Fraser: Not all researchers using grid computing require HTPC computing. It all boils down to the type of problem being solved. One common research pattern for high throughput computing is the parameter sweep, where multiple jobs are solved in parallel, each executing the same algorithm with different starting parameters. It may start out that each job is running on a single core. However, as the research advances it may become apparent to the researcher that she needs more memory than, say, 1GB per job, or that her job could benefit by using more than one processor. HTPC is a pattern that helps researchers get their science done.
iSGTW: You've been leading the effort to enable HTPC across OSG. When did that project start, and how far along is it? What are some of the challenges you had to surmount to enable HTPC for OSG?
Fraser: This project started just over two years ago in response to a grant from the NSF. To make this work, we needed to surmount a handful of challenges. That's why they call it research, right?
The first step was to work with some users that needed HTPC capabilities and structure their jobs so that they could take advantage of HTPC resources. The next step was to enable the capability at some test sites by enabling the whole machine node access capability in the batch scheduler systems (e.g. PBS, LSF, Condor, ...).
iSGTW: Was that like flipping a switch or changing a setting, or did it require development work?
Fraser: In some cases (PBS, LSF), this required finding and flipping a switch. In other cases (e.g. with Condor) there was some development work.
Next we needed to enable the OSG pilot based job submission mechanisms to recognize and properly handle HTPC jobs; this required development work.
Then we realized that the OSG accounting system (Gratia) needed some updating to be able to properly account for jobs that run on whole machines. This has been done, but it hasn't been fully tested yet.
iSGTW: So what is the status of all of this work at the moment?
Fraser: Today we are in a state of limited production on the OSG. We have HTPC enabled on multiple sites within the OSG. From the researcher's perspective, access to HTPC resources is transparent – it is no different than accessing regular high throughput computing resources within the OSG.
Of course we are still overcoming a few challenges related to effective scheduling of HTPC resources on the OSG using Condor. But we hope to be using this capability much more routinely towards the end of the year.
iSGTW: Did you encounter any surprises while you worked on this system?
Fraser: One of the interesting discoveries we made this year at RENCI was that by enabling whole-node scheduling for HTPC, we've also done so for a class of computational chemistry researchers that require access to GPUs. The HTPC model allows researchers to reserve whole machines with GPU resources as well as CPU and memory, and thereby take advantage of this capability on the OSG. In effect, this opened the OSG up to a whole new class of researchers.