• Subscribe

Keeping on top of the data deluge

Image of digital DNA strands
DNA strands. Image courtesy iStock.

High-performance computing is the workhorse for large-scale scientific analyses of DNA sequences. In many cases, the more programming knowledge a researcher has the more they can benefit - they can do faster or more complex analyses for example. But in genomics some face a steep learning curve when faced with next-generation sequencing data analysis. This year, the University of Illinois, US, launched a High-Performance Biological Computing group (HPCBio) to help the biological research community climb this learning curve. Although HPCBio's approach is not unique, the service is vitally important given that it now costs more to analyze a genome than to sequence it.

"HPCBio is not a high-performance computer, but uses computing facilities at the Institute for Genomic Biology and the National Center for Supercomputing Applications based at the University of Illinois," said Victor Jongeneel, director of the HPCBio group. "So far, user feedback is positive."

Their strategy is to understand a customer's needs from the outset. "We always sit down with researchers first to understand their problem," said Jongeneel. "We agree on a sensible scientific approach. It's very important to do that. Then we give a quote on the cost which is reasonable. As we're not part of XSEDE and don't get National Science Foundation funding we have to be self-funded and charge user fees."

A researcher can request a full analysis of their data or they can access HPCBio's computers through a command line or web interface. These services are available to bioinformaticians and genomics researchers outside of the University of Illinois too. "At this time, these researchers have access to our consulting services," said Jongeneel. "We perform the analysis of datasets produced at many institutions, including outside the US."

As Jongeneel was one of the founding members of a similar high-performance computing facility at the Swiss Institute of Bioinformatics, which successfully services the majority of research centers and universities in western Switzerland, he said this model can work for the University of Illinois.

Is this a model that works?

An image of researcher Stephen Moose sitting next to Miscanthus grasses.
Crop plant studies are just some of the genomic projects carried out at the University of Illinois. To help understand grass evolution, researchers such as Stephen Moose (pictured) and his colleagues mapped the Miscanthus sinensis genome. Miscanthus grasses (pictured) are used in gardens, burned for heat and energy, and converted into liquid fuels. They also belong to a major grass family that includes corn and sugarcane. The chromosome maps produced are a first step in sequencing the M. sinensis genome. Image courtesy Institute for Genomic Biology, University of Illinois.

The facility handles a variety of genomic biology projects including crop plants and their pests, environmental microbiomes, cancer transcriptomes, and gene expression in insects.

With HPCBio, researchers currently have access to 5,000 processor cores, 10 terabytes of RAM and 500 terabytes of storage - and these resources continue to grow. "We don't know how much we'll grow by, but what we've seen in this field is that we need to grow by 50% per year to keep up with data produced by genomic sequencing," said Jongeneel.

What is useful about this computing environment to biology researchers is not just its scalable computing resources but the software services that come along with it. They offer a full service pay-per-use model for small or large projects. This is not a cloud service as the computing environments are not separated from the hardware. But, this means a researcher does not need to be fully versed in a programming language to access HPCBio's vast resources and can spend more time deciding on the type of analysis they require.

According to Jongeneel, HPCBio goes beyond current high-performance computing centers and grid computing resources. "We're a full service facility. We offer a much wider range of services than a traditional high-performance facility. Users can even give us their raw data. With the high-performance computing environments offered by XSEDE, researchers have to have a certain level of competence in programming. With us they do not. We provide extensive domain knowledge in computational genomics and a full bioinformatics software stack."

However, high-performance computing centers, such as the Texas Advanced Computing Center (TACC), under the XSEDE network also have a fully fledged life sciences and computational biology group and released a suite of 30 new and updated applications last year.

"The Texas Advanced Computing Center's staff of four computational biologists maintain over 85 (and counting) biology software packages," said Matthew Vaughn, manager of the Life Sciences Computing Group at TACC.

"We update them on a quarterly basis to take advantage of new features and optimizations, and we actively train and support the biology community in using these programs via a combination of workshops, online tutorials, and individual consulting. TACC provides over seven petabytes of high-speed, geographically replicated disk storage and 100 petabytes of tape capacity. We're also working with high performance computing specialists to get all of these codes ready for Stampede, TACC's upcoming petaflop high-performance computer, which comes online in January 2013. In summary, TACC offers comprehensive support for biology researchers working at the leading edge of genome research, and we are looking ahead to be ready for new technologies as they emerge."

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2019 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.