• Subscribe

PRObE testbed: A first of its kind makes large-scale systems research possible

There is no doubt about it; to support future scientific discovery, computer systems will need to grow larger and many times faster. The next big jump, exascale, will be 1000 times faster than the petabyte computers that debut in 2008. Many US computing centers, national labs, and centers of excellence provide researchers with high-performance computer resources for running software. However, a testbed at scale - where researchers can study how computer systems grow and what happens when they grow - has been hard to come by.

PRObE (Parallel Reconfigurable Observational Environment), an experimental cluster available to systems researchers across the US, fills that void. Sponsored by the US National Science Foundation, it serves as a test bed for large-scale experiments, which would be impossible at smaller scales. The cluster is located at the New Mexico Consortium (NMC) in Los Alamos, US.

"PRObE is a unique resource where systems researchers have full access to the servers, including the hardware, while their experiment is running," explains Andree Jacobson, PRObE project manager, and computer and information systems manager at NMC. "In order to predict what will happen with software and hardware at scale, you need to have a testbed at scale to try things out on. Here, researchers can really tinker; they can power off computers, unplug cables, crash hardware or software, modify the network, all while their experiment is running - something impossible at the national centers."

Available:

Marmot - 128 nodes, 256 cores at CMU

  • Dual socket, single core AMD Opteron, 16GB per node
  • SDR Infiniband, 2TB disk per node

Denali - 64+ nodes, 128+ cores at NMC

  • Dual socket, single core AMD Opteron, 8GB per node
  • SDR Infiniband, 2x1TB disk per node

Kodiak - 1,024 nodes, 2,048 cores at NMC

  • Dual socket, single core AMD Opteron, 8GB per node
  • SDR Infiniband, 2x1TB disk per node, 1.8PB total

Slated for July 2013:

Susitna - 34 nodes, 4xAMD6272 = 64 cores at CMU

  • 128 GB, 40GE, FDR10 IB
  • 1 TB OS + 2x3TB disk
  • 1.2 TFLOPS double-precision K20 GPUs

"Our hardware is not more powerful than you'll find at the centers, but we have many more machines to do these kinds of tests on, than say, at a university or in the high-performance computing community." The testbed computers are, in fact, decommissioned machines donated by Los Alamos National Lab (LANL), located just across the street from NMC.

"Los Alamos National Laboratories regularly set aside machines to make way for equipment that offers more efficiency per watt and per dollar," says Garth Gibson, a professor of computer science and engineering at Carnegie Mellon University in Pennsylvania, US. Speaking at the 22nd International ACM Symposium on High-Performance, Parallel, and Distributed Computing in New York City, US, Gibson acknowledged the critical necessity of testing at scale. "Not enough of our students have actually experienced large scale before. Something you can do on 12 processors is not the same problem as it is on a thousand."

The mastermind behind PRObE is Gary Grider, division leader for high-performance computing at LANL. "The idea actually came about in 2006, while I was on a plane working on paperwork to dispose of a computer. At that time, I was involved with the interagency high-end computing working group at LANL, which was trying to build a community around large-scale storage and file systems. One of the outcries of that community was that they had no place to try out ideas - a system that could be disrupted."

It took a while to bring the PRObE project to fruition, but Grider has always had his eyes set squarely on the future: "Not only do we need researchers building systems software for high-performance machines; we need people who are able to run and operate those machines." This conviction ultimately led to summer enrichment opportunities for third-year undergraduates. The Computer System, Cluster, and Networking Summer Institute (CSCNSI) at NMC is aimed at students already engaged in a computer science, computer engineering, or similar major.

Following a merit process, 12 highly-sought positions are awarded to third year undergraduates, and these students learn all of the ins and outs of HPC tools, booting massively parallel systems, loading software stacks, and understanding security concerns. After being supported through this two-week, cluster-building 'bootcamp' by dedicated faculty, the three-student teams are partnered with a mentor from LANL, usually a systems researcher or administrator who has a project in waiting for each group.

Upon completion of the program, the students' names are provided (with their consent) to all of the national labs and high-performance centers, as potential future employees with advanced training as systems administrators. "Sixty to seventy students have been through the program, and the track record of placement is phenomenal - even better than here at LANL, and we have 1,200 students a year," says Grider.

Supercomputers are never going to get smaller, concedes Jacobson. "There are going to be more nodes and more cores, and it will be hard to predict what will happen with software and hardware in the future, unless you have a testbed at scale like PRObE. If PRObE can help in understanding how computer systems grow and what will happen as they grow, this project will have made a major contribution to the future of computing."

The PRObE clusters, named Marmot, Denali, and Kodiak, are all currently available. Brand new hardware, Susitna, is slated for release in July 2013. PRObE is currently accepting applications to use Kodiak, its largest cluster. In the meantime, the two smaller clusters, Marmot and Denali, are available for staging. Principal investigators can log on to the portal to request new projects. Success on the staging clusters is required before applications for time on Kodiak will be approved.

PRObE is a collaboration between the US National Science Foundation, New Mexico Consortium, Carnegie Mellon University, and the University of Utah.

For more about PRObE, visit the FAQ, portal, or website, or join the Google Group.

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2021 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.