High-performance computing is the workhorse for large-scale scientific analyses of DNA sequences. In many cases, the more programming knowledge a researcher has the more they can benefit - they can do faster or more complex analyses for example. But in genomics some face a steep learning curve when faced with next-generation sequencing data analysis. This year, the University of Illinois, US, launched a High-Performance Biological Computing group (HPCBio) to help the biological research community climb this learning curve. Although HPCBio's approach is not unique, the service is vitally important given that it now costs more to analyze a genome than to sequence it.
"HPCBio is not a high-performance computer, but uses computing facilities at the Institute for Genomic Biology and the National Center for Supercomputing Applications based at the University of Illinois," said Victor Jongeneel, director of the HPCBio group. "So far, user feedback is positive."
Their strategy is to understand a customer's needs from the outset. "We always sit down with researchers first to understand their problem," said Jongeneel. "We agree on a sensible scientific approach. It's very important to do that. Then we give a quote on the cost which is reasonable. As we're not part of XSEDE and don't get National Science Foundation funding we have to be self-funded and charge user fees."
A researcher can request a full analysis of their data or they can access HPCBio's computers through a command line or web interface. These services are available to bioinformaticians and genomics researchers outside of the University of Illinois too. "At this time, these researchers have access to our consulting services," said Jongeneel. "We perform the analysis of datasets produced at many institutions, including outside the US."
As Jongeneel was one of the founding members of a similar high-performance computing facility at the Swiss Institute of Bioinformatics, which successfully services the majority of research centers and universities in western Switzerland, he said this model can work for the University of Illinois.
Is this a model that works?
With HPCBio, researchers currently have access to 5,000 processor cores, 10 terabytes of RAM and 500 terabytes of storage - and these resources continue to grow. "We don't know how much we'll grow by, but what we've seen in this field is that we need to grow by 50% per year to keep up with data produced by genomic sequencing," said Jongeneel.
What is useful about this computing environment to biology researchers is not just its scalable computing resources but the software services that come along with it. They offer a full service pay-per-use model for small or large projects. This is not a cloud service as the computing environments are not separated from the hardware. But, this means a researcher does not need to be fully versed in a programming language to access HPCBio's vast resources and can spend more time deciding on the type of analysis they require.
According to Jongeneel, HPCBio goes beyond current high-performance computing centers and grid computing resources. "We're a full service facility. We offer a much wider range of services than a traditional high-performance facility. Users can even give us their raw data. With the high-performance computing environments offered by XSEDE, researchers have to have a certain level of competence in programming. With us they do not. We provide extensive domain knowledge in computational genomics and a full bioinformatics software stack."
However, high-performance computing centers, such as the Texas Advanced Computing Center (TACC), under the XSEDE network also have a fully fledged life sciences and computational biology group and released a suite of 30 new and updated applications last year.
"The Texas Advanced Computing Center's staff of four computational biologists maintain over 85 (and counting) biology software packages," said Matthew Vaughn, manager of the Life Sciences Computing Group at TACC.
"We update them on a quarterly basis to take advantage of new features and optimizations, and we actively train and support the biology community in using these programs via a combination of workshops, online tutorials, and individual consulting. TACC provides over seven petabytes of high-speed, geographically replicated disk storage and 100 petabytes of tape capacity. We're also working with high performance computing specialists to get all of these codes ready for Stampede, TACC's upcoming petaflop high-performance computer, which comes online in January 2013. In summary, TACC offers comprehensive support for biology researchers working at the leading edge of genome research, and we are looking ahead to be ready for new technologies as they emerge."