Researchers worldwide have access to a growing library of quantum chromodynamics (QCD) software that has been ported to run on graphics processing units, thanks to the efforts of a handful of researchers.
"Lattice QCD is in many ways ideal for GPUs, since the bulk of QCD computations can be broken down into many parallel independent calculations that can be mapped directly onto the threaded model of computation that GPUs utilize," said Mike Clark, a physicist at Harvard.
A great deal of LQCD research involves running simulations based on theoretical models, followed by the analysis of the data generated by those simulations. In most fields, the computational cost of analyzing the simulated data is negligible; LQCD is among the exceptions. In fact, in LQCD the two phases have very different characteristics.
"The 'analysis' phase consists of hundreds of independent jobs which can be farmed out," said Ronald Babich, a Boston University physicist. "Each such job generally requires tens of conventional cluster nodes - or a handful of GPUs."
At the moment, the simulation phase requires the use of high-performance computing systems at the very top of the top500 list. That may not always be the case, however.
"We are working to develop the software components and algorithmic improvements necessary to offload some of this workload to large GPU clusters," Babich said.
The United States QCD collaboration, of which Babich and Clark are members, already has access to a 500 GPU cluster at Jefferson Laboratory (JLab), and a 16 GPU cluster at Fermi National Accelerator Laboratory (Fermilab); the JLab facility consists of four GPUs per host machine, with two generations spread between the 2009 and 2010 clusters. The Fermilab cluster consists of two GPUs per host machine, but those numbers will be increasing soon; another 128 GPUs are scheduled for installation at Fermilab in September 2011, said Don Holmgren, a Fermilab-based computer scientist.
A seed takes root
When Clark and Babich first teamed up with colleagues to port some QCD code to GPUs, their plans were on a very small scale. But the poster the team presented at a conference in the spring of 2008 (Lattice 2008) was so well-received that several team members resolved to continue working on it, turning it into a supported library for other QCD researchers to use.
One year later, in the spring of 2009, they distributed the first public release of the library. They named it QUDA, a combination of QCD and CUDA (NVIDIA's proprietary programming language for GPUs); that first release of QUDA was designed to interface with the three main application suites maintained by the USQCD (Chroma, Columbia Physics System, and MILC).
This was the first time Babich and Clark had worked with GPUs.
"I had thought about using GPUs for QCD calculations as far back as 2004, but it wasn't until CUDA was first released that I really became interested in it," Clark said.
Before CUDA, you had to fool the GPU into believing that you were doing graphics. CUDA eliminates that overhead while providing direct access to the hardware. Not only is it easier to use, said Babich, but it is also usually possible to achieve higher performance using CUDA instead of the old approach.
CUDA, which is based on C, supports a number of C++ features. There is an open standard alternative to CUDA called OpenCL. But OpenCL does not support C++ code, and QUDA makes extensive use of CUDA's C++ features. They've also made extensive use of automated code generation by the C pre-processor and Python.
Tips from the GPU experts
- favor structures of arrays over arrays of structures
- use mixed precision methods, including using half precision (16-bit)
- in all cases, whether it be single GPU, or multi-GPU, the following mantra is a good rule of thumb: flops are free, bandwidth is expensive. Using algorithms that reduce the bandwidth even at the expense of increasing the amount of the computation almost always tend to win.
Despite the advantages conferred by the advent of CUDA and the use of automated code generation, the project still faces a number of challenges.
"Porting a legacy application to GPUs is not something that can be simply undertaken," Clark said. "Our approach with the QUDA library is to provide an interface such that other applications can call it."
So far, most of their work has focused on optimized linear solvers, which account for somewhere between 50 and 99% of a typical QCD workload. But the project keeps growing.
"Over the last year we have attracted more developers (as well as users)," Clark said. "We both put an appreciable and unquantifiable number of hours in each week, whether it be sitting in front of our terminals coding, proselytizing our work at presentations, or thinking up new algorithms over coffee.
Today, QUDA has four developers (including Babich and Clark) and six contributors, and boasts 44,000 lines of code (including comments and blank lines, but not including code generated automatically). They also have a growing number of users doing real physics.
Ringing the bell
Access to GPU resources has already led to a number of papers. Robert Edwards, a physicist based at Jefferson Laboratory, recently co-authored one of those papers with five other researchers.
"QCD describes how quarks are bound together [into a larger particle such as a proton]. However, the theory predicts that quarks can not be pulled apart into isolation," Edwards said.
This limitation makes it difficult to study the bonds between quarks.
"What we can do is something like ringing a bell," Edwards explained. "We "tap" or "zap" the bound quarks - the proton - and make it ring."
That metaphorical "ringing" is akin to the colored light atoms emit when an excited electron drops to a lower discrete energy state. In this case, however, it is the quarks inside a proton that are excited, rather than an electron orbiting an atomic nucleus at one of several discrete energy levels.
"The observation of these discrete [electron] energy jumps was a big discovery 100 years ago, and prompted Neils Bohr to come up with the first version of quantum mechanics - a development that revolutionized modern physics," Edwards said. "Now, 100 years later, we - the practitioners of QCD - are in a similar position. We are observing the discrete excited states of protons and making predictions as to how the quarks are bound and what are the fundamental ways they interact."
Using a cluster of 200 GPUs, they were able to compute for the first time the highly 'excited' spectra generated by a type of bound quark states - also known as sub-atomic particles - called "exotic isoscalar mesons." In particular, they predicted the mass of a particle that, if confirmed in experiments, suggests that quarks and the "glue" that holds them together are bound in a way that has never before been seen.
More to come
Clark and Babich plan to continue to port QCD code to GPUs and publish the resulting modules to the web. But their plans don't stop there.
"Clearly, using only a modular library approach is not a long term solution to this problem," said Clark. "We envisage that we will have to port the underlying domain-specific languages used by the USQCD applications ... to the GPUs to allow us to surmount this difficulty."
USQCD uses two programming languages specific to QCD: QDP (QCD Data Parallel), based on C, and QDP++, based on C++.
"The ideal would be for code that relies on the existing QDP framework to be able to use a GPU-enabled version without modification or with only trivial modifications that could be carried out mechanically," Babich said. "This appears to be possible, and work in this direction for QDP++ is already underway."
Clark cautions, however, that this port of QDP++ does not negate the value of a highly hand-optimized library of key routines, since a domain-specific language will never achieve as high performance.
As for the code that has already been ported to GPUs, there is a new bottleneck decreasing the performance researchers can expect: bandwidth.
It turns out that GPUs are designed to execute approximately seven flops per single byte of information transferred. QCD applications, however, need to transfer seven bytes for every seven flops. That means that the GPU finishes calculations faster than it can receive the data for the next calculation, or return the data from the previous calculation. When working with a single GPU, the internal bandwidth is what causes the problem. But in a cluster with many GPUs, the bandwidth between the GPUs becomes a factor as well.
Industry advances may help to solve those problems. The widespread adoption of PCI-Express 3.0 over the next few years should increase GPU-to-host and inter-GPU bandwidth. CUDA 4.0 has enabled direct communication between the newest Tesla GPUs (Nvidia's professional GPU model line) on the same host, as well as direct transfers from GPU memory across certain high-speed networks without involvement of the host CPU. Support for those features is in turn being added to the QUDA library, which should help to reduce communication time, Babich said.
GPUs such as the Teslas will play an increasing role in the QCD toolkit, Clark and Babich said. In addition to the communications enhancements enabled on these GPUs with CUDA 4.0, these GPUs offer several features intended to increase reliability.
"Going forward, I think a typical LQCD job will increasingly involve at least some routines that demand such features, and so future deployments will probably favor the professional cards more heavily," Babich said.
If it sounds like they have their work cut out for them, you're right.
"Algorithms that run well on CPUs often need to be reimagined for GPUs," Clark said. "This is an ongoing research effort, so in effect, it will never finish."