• Subscribe

Feature - Getting GPUs on the grid

Feature - Getting GPUs on the grid

Russ Miller, principal investigator at CI Lab, stands in front of the server rack that holds Magic, a synchronous supercomputer that can achieve up to 50 Teraflops. Image courtesy of CI Lab.

Enhancing the performance of computer clusters and supercomputers using graphical processing units is all the rage. But what happens when you put these chips on a full-fledged grid?

Meet "Magic," a supercomputing cluster based at the University of Buffalo's CyberInfrastructure Laboratory (CI Lab). On the surface, Magic is like any other cluster of Dell nodes. "But then attached to each Dell node is an nVidia node, and each of these nVidia nodes have roughly 1000 graphical processing units," said Russ Miller, the principal investigator for CI Lab. "Those GPUs are the same as the graphical processing unit in many laptops and desktops."

That's the charm of these chips: because they are mass-manufactured for use in your average, run-of-the-mill computer, they are an extremely inexpensive way of boosting computational power. That boost comes at a price, however.

"These roughly 1000 processors on each nVidia node are programmed in a synchronous process, basically bringing us back to programming methods of the 1960s," said Miller.

The parallel programs modern supercomputers run are already quite difficult to write. Synchronous programming is a more limited form of parallel programming. "Parallel means doing multiple things at the same time," explained Miller. "Synchronous means doing the exact same thing at the same time."

Synchronous computations could be processing very different sets of data, as long as the algorithm used is identical. For example, an algorithm could instruct two people to kick the object in front of them. If one is playing soccer while the other is learning self-defense, the instruction may be identical, but the context, meaning and effects are quite different. "The job becomes very demanding for a programmer to be able to exploit these roughly 13 000 processors that we have in one rack. But if they can, the returns are huge," said Miller. "We can get roughly 50 teraflops of computing out of one rack of systems."

In a perfect world, scientists could submit their computational jobs to a scheduling application, and the scheduler would take care of finding computing resources. "I want to be able to be lying on a beach in Cancun with my iPhone, and hit a button, and not have to worry about where my data is, our what resources it's using," said Miller. "But there's no way you could submit to the grid and have it assign a GPU for you. That logic is not built into the software stack just yet."

To do so, resource providers would need the ability to specify that they can only handle synchronous computations, and users would need to be able to specify what sorts of resources their computations can exploit.

In the meantime, Magic has been hooked up to Open Science Grid and the New York State Grid since February. And instead of relying on a high-tech scheduler to assign jobs to the cluster, CI Lab has been relying on much older 'technology' - word of mouth.

"One of the biggest sources of users we have seen so far is just word of mouth," said Kevin Cleary, the system administrator for Magic. "So getting the word out that on these nodes we do have these massive amounts of power available."

Once a researcher is aware that Magic is available, he or she can tell the scheduler to submit the job directly to the GPU cluster.

As other GPU clusters come online, word of mouth may become an impractical solution. In the meantime, however, it is working well for Magic. Said Cleary, "In the past week, nearly 2500 jobs have been run on this cluster with a 98 per cent success rate."

-Miriam Boon, iSGTW

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2023 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.