Modern research tools like high-performance computers and particle colliders are generating so much data, so quickly, that many scientists fear they will not be able to keep up with the deluge. Now, for the first time, Berkeley researchers have designed strategies for extracting interesting data from massive scientific datasets, and queried 32 terabytes of a trillion particle dataset in three seconds.
"These instruments are capable of answering some of our most fundamental scientific questions, but it is all for nothing if we can't get a handle on the data and make sense of it," said Surendra Byna of the Lawrence Berkeley National Laboratory's (Berkeley Lab's) Computational Research Division (CRD).
That's why researchers from Berkeley Lab's CRD, the University of California, San Diego (UCSD), Los Alamos National Laboratory, Tsinghua University, and Brown University teamed up to develop software strategies for storing, mining, and analyzing massive datasets - specifically, for data generated by a state-of-the-art plasma physics code called VPIC.
When the team ran VPIC on the Department of Energy's National Energy Research Scientific Computing Center's (NERSC's) Cray XE6 'Hopper' high-performance computer, they generated a 3D dataset of a trillion particles to better understand magnetic reconnection in particles. Magnetic reconnection is a physical process where magnetic topology is rearranged, and magnetic energy is converted into kinetic energy, thermal energy, and particle acceleration.
VPIC simulated the process in thousands of time-steps, periodically writing a 32 terabyte file - which is five times more than the world's largest library, the US Library of Congress - to disk at specified times. Each time-step was a frame in the bigger simulation.
Every 32 terabyte file was written to disk in about 20 minutes, at a sustained rate of 27 gigabytes per second. By applying an enhanced version of FastQuery, an information query language for large and complex data, they indexed this massive dataset in about 10 minutes, and then queried it in three seconds for interesting features to visualize.
Trillions of particles require exascale computing
According to Homa Karimabadi, who leads the space physics group at UCSD, one of the major unsolved mysteries in magnetic reconnection is the conditions and details of how energetic particles are generated. Until recently, the closest that anybody had come to studying this was by looking at 2D simulations.
"To answer these questions we need to take into full account additional effects such as flux-rope interactions and resulting turbulence that occur in 3D simulations," Karimabadi said. "But, as we add another dimension, the number of particles in our simulations grows from billions to trillions. And it is impossible to pull up a trillion-particle dataset on your computer screen."
To address these challenges, Karimabadi joined forces with the ExaHDF5 team, a DOE funded collaboration to develop high-performance computation input and output, and analysis strategies for future exascale computers.
A scalable storage approach for a successful search
According to Byna, VPIC models magnetic reconnection by breaking down the 'big picture' into pieces, each of which are assigned, using a Message Passing Interface (MPI), to a group of processors to compute.
In the original implementation of VPIC, each MPI domain generates a binary file once it finishes processing its piece. One major limitation to this approach is that the number of files generated for large simulations become unwieldy. The team's largest VPIC run contained about 20,000 MPI domains - 20,000 binary files per time-step.
"It takes a really long time to perform a simple Linux search of a 20,000-file directory. Ultimately, these limitations become a bottleneck to scientific analysis and discovery," Byna said.
By incorporating a high-performance parallel data interface to HDF5, called H5Part code, into the VPIC codebase, the team overcame all of challenges. This modification creates one shared HDF5 file per time-step, instead of 20,000 independent binary files. Because most visualization tools use HDF5 files, this eliminates the need to re-format the data.
Data mining made easier with FastQuery
Once the information had been stored, the next challenge was making sense of it. On this front, team members implemented an enhanced version of FastQuery. Using this tool, they indexed the 32 terabyte dataset in about 10 minutes, and queried it in three seconds. This was the first time a trillion-particle dataset had been queried this quickly.
The team accelerated FastQuery's capabilities by implementing a hierarchical load-balancing strategy. Because FastQuery is built on FastBit indexing technology, researchers can search their data based on an arbitrary range of conditions that are defined by available data values. This means researchers can search a trillion particle dataset and sift out electrons by their energy values.
This capability also helps with visualization. Because most computer displays contain only a few million pixels, it's impossible to render a dataset with trillions of particles. Now, researchers can use FastQuery to identify particles of interest to render.
Karimabadi said, "Although our VPIC runs typically generate two types of data - grid and particle - we never did a whole lot with the particle data because it was really hard to extract information from a trillion particle dataset."
A version of this story first appeared on Berkeley Lab's Computational Research Division website.