- Large data centers underpin many aspects of modern life from science to entertainment
- These centers use 3% of the world’s electricity and that use is only growing
- Smarter choices and monitoring can lead to lower power consumption and greener computing
If we’re going to rapidly develop a vaccine for COVID-19, discover new energy sources, and investigate the origins of the universe, we’re going to need more compute power. But we can’t build bigger, faster computers without considering their increased demand for power, cooling, and physical real estate.
Large data centers support our public transportation systems, the global supply chain, and even virtual Netflix parties where people under stay-at-home orders can simulate a movie theater and watch 6 Underground together from multiple locations.
Globally, these data centers consume three percent of the world’s electricity, and that number will continue to rise unless decisions are made to minimize their energy usage even as we increase their use. So how can we solve our most pressing problems without destroying the planet?
By learning more about how current data centers work, says Melissa Romanus, a data management engineer at the National Energy Research Scientific Computing Center (NERSC) in Berkeley, CA.
NERSC is the US Department of Energy’s (DOE) mission-science supercomputer facility. More than 7000 scientists use NERSC resources to further their science, developing climate models, researching new materials, and experimenting with high energy physics.
Romanus works with the Operations Technology Group at NERSC, who collect heterogeneous data from multiple sources into a central data warehouse called OMNI (Operations Monitoring and Notification Infrastructure). The group monitors the health of the systems, storage, and facility environment of the data center, as well as analyzing the data to correlate information and make improved business decisions.
Romanus believes this information can help operators make better procurement choices toward more energy-efficient equipment, create policies for computational job scheduling that take into account time of day to perform high-usage tasks, or choose equipment that can be remotely managed.
“In a supercomputer facility we have networks, cooling, air moving through the facility, all the nodes and processors,” says Romanus. “All of these things have metrics associated with them that could be optimized to be more energy efficient.”
This centralized data warehouse proved especially valuable when planning for Perlmutter, a pre-exascale system planned for later this year. It will be three times faster than the current Cori supercomputer, currently ranked #13 on the Top500 list of the world’s fastest supercomputers.
The NERSC facility currently has several electrical substations dedicated to compute resources, and a mechanical substation dedicated to handling air and water environments. Initial estimates suggested that Perlmutter would require an additional mechanical substation.
“But when we looked at the operational data over the last two years for our other HPC machines, we saw that we were using less than 1 megawatt of capacity from the existing mechanical substation,” says Romanus.
“So we found out that we don’t actually need to get another whole dedicated mechanical substation for Perlmutter,” says Romanus. “Which resulted in a savings of about $2.5 million for Berkeley Lab.”
Making new connections
Though energy efficiency was one of the early successes of the data warehouse, it has grown to encompass a whole lot more.
“Energy efficiency is extremely important,” says Romanus. “But over time we added more data sources as people found them valuable, and it has really opened up new questions.”
Romanus is particularly interested in the new insights that can be realized by having so many different kinds of data available in a centralized location. She points to potential correlations between environmental and applications factors.
“You can look at what was the state of the rack, or the temperature, or the air moving through the rack at the time a specific job was running,” says Romanus. “We’re trying to help people make those connections.”
OMNI also collects data on network traffic—who’s logging in, what port they’re targeting, the amount of data being transferred over packets, and more. The network team uses that data to investigate where bottlenecks are and improve the user experience.
For example, a climate change researcher may be transferring data from an instrument thousands of miles away into a storage area at NERSC before performing computation on Cori. OMNI lets managers see which piece of the workflow could cause setbacks.
The holistic data view is also valuable from a cybersecurity perspective. “Lots of data provides us with the potential to see more patterns,” says Romanus. “For example, if we’re seeing a lot of connections from somewhere we’re not expecting, we can observe it in real time and log into that switch and see, ‘Hey, what’s going on here?”
The current moment is especially challenging, she adds. “People are working from home and that means our cybersecurity team sees more unusual types of patterns now as people have more idle time.”
Visualizing the rewards
“I also love the visual aspect to it,” she continues. “It’s very rewarding to perform a query and have the system return an interesting graph. You can ask questions and get visual feedback within minutes. It’s been tremendously valuable to many groups at the lab.”
Another bonus is that this work won’t just benefit NERSC and other DOE facilities. OMNI keeps licensing cost to a minimum by utilizing as much open-source software as possible for its administration, data collection processes, and visualization.
“Our goal is to have this methodology be applicable to other data centers,” says Romanus. So maybe one day, humanity’s insatiable appetite for all things data will have a lighter environment impact. We can hope.