- It’s difficult to ensure that all components of a supercomputer operate efficiently
- XDMoD tool measures quality of service, system, and user job performance
- Future improvements will also track cloud-based metrics
As the name implies, supercomputers are pretty special machines. Researchers from every field seek out their high-performance capabilities, but time spent using such a device is expensive. As recently as 2015, it took the same amount of energy to run Tianhe-2, the world’s second-fastest supercomputer, for a year as it did to power a 13,501 person town in Mississippi.
And that’s not to mention the initial costs associated with purchase, as well as salaries for staff to help run and support the machine. Supercomputers are kept incredibly busy by their users, often oversubscribed, with thousands of jobs in the queue waiting for others to finish.
With computing time so valuable, managers of supercomputing centers are always looking for ways to improve performance and speed throughput for users. This is where Tom Furlani and his team at the University at Buffalo’s Center for Computational Research, come in.
Thanks to a grant from the National Science Foundation (NSF) in 2010, Furlani and his colleagues have developed the XD Metrics on Demand (XDMoD) tool, to help organizations improve production on their supercomputers and better understand how they are being used to enable science and engineering.
"XDMoD is an incredibly useful tool that allows us not only to monitor and report on the resources we allocate, but also provides new insight into the behaviors of our researcher community," says John Towns, PI and Project Director for the Extreme Science and Engineering Discovery Environment (XSEDE).
Canary in the coal mine
Modern supercomputers are complex combinations of compute servers, high speed networks, and high performance storage systems. Each of these areas is a potential point of under performance or even outright failure. Add system software and the complexity only increases.
With so much that can go wrong, a tool that can identify problems or poor performance as well as monitor overall usage is vital. XDMoD aims to fulfill that role by performing three functions:
1. Job accounting – XDMoD provides metrics about utilization, including who is using the system and how much, what types of jobs are running, plus length of wait times, and more.
2. Quality of service – The complex mechanisms behind HPC often mean that managers and support personnel don’t always know if everything is working correctly—or they lack the means to ensure that it is. All too often this results in users serving as “canaries in the coal mine” who identify and alert admins only after they’ve discovered an issue.
To solve this, XDMoD launches application kernels daily that provide baseline performances for the cluster in question. If these kernels show that something that should take 30 seconds is now taking 120, support personnel know they need to investigate. XDMoD’s monitoring of the Meltdown and Spectre patches is a perfect example—the application kernels allowed system personnel to quantify the effects of the patches put in place to mitigate the chip vulnerabilities.
3. Job-level performance – Much like job accounting, job-level performance zeroes in on usage metrics. However, this task focuses more on how well users' codes are performing. XDMoD can measure the performance of every single job, helping users to improve the efficiency of their job or even figure out why it failed.
Furlani also expects that XDMoD will soon include a module to help quantify the return on investment (ROI) for these expensive systems, by tying external funding of the supercomputer’s users to their external research funding.
Thanks to its open-source code, XDMoD’s reach extends to commercial, governmental, and academic supercomputing centers worldwide, including England, Spain, Belgium, Germany, and many others.
Future features
In 2015, the NSF awarded the University at Buffalo a follow-on grant to continue work on XDMoD. Among other improvements, the project will include cloud computing metrics. Cloud use is growing all the time, and jobs performed there are much different in terms of metrics.
For the average HPC job, Furlani explains that the process starts with a researcher requesting resources, such as how many processors and how much memory they need. But in the cloud, a virtual machine may stop running and then start again. What’s more, a cloud-based supercomputer can increase and decrease cores and memory. This makes tracking performance more challenging.
“Cloud computing has a beginning, but it doesn’t necessarily have a specific end,” Furlani says. “We have to restructure XDMoD’s entire backend data warehouse to accommodate that.”
Regardless of where XDMoD goes next, tools like this will continue to shape and redefine what supercomputers can accomplish.