• Subscribe

Simplifying and accelerating genome assembly

Genome assembly process improved from months to minutes

To extract meaning from a genome, scientists must reconstruct portions — a time consuming process akin to rebuilding the sentences and paragraphs of a book from snippets of text. But by applying novel algorithms and high-performance computational techniques to the cutting-edge de novo genome assembly tool Meraculous, a team of scientists have simplified and accelerated genome assembly — reducing a months-long process to mere minutes.

Speed read
Novel algorithms and high-performance computational techniques have simplified and accelerated genome assembly. Scientists from Berkeley Lab have used the Edison supercomputer to reduce a months-long process to mere minutes.
“The new parallelized version of Meraculous shows unprecedented performance and efficient scaling up to 15,360 processor cores for the human and wheat genomes on NERSC’s Edison supercomputer,” says Evangelos Georganas. “This performance improvement sped up the assembly workflow from days to seconds.” Courtesy NERSC.

Researchers from the Lawrence Berkeley National Laboratory (Berkeley Lab) and UC Berkeley have made this gain by 'parallelizing' the DNA code — sometimes billions of bases long  — to harness the processing power of supercomputers, such as the US Department of Energy's National Energy Research Scientific Computing Center's (NERSC's) Edison system. (Parallelizing means splitting up tasks to run on the many nodes of a supercomputer at once.)

“Using the parallelized version of Meraculous, we can now assemble the entire human genome in about eight minutes,” says Evangelos Georganas, a UC Berkeley graduate student. “With this tool, we estimate that the output from the world’s biomedical sequencing capacity could be assembled using just a portion of the Berkeley-managed NERSC’s Edison supercomputer.”

Supercomputers: A game changer for assembly

High-throughput next-generation DNA sequencers allow researchers to look for biological solutions — and for the most part, these machines are very accurate at recording the sequence of DNA bases. Sometimes errors do occur, however. These errors complicate analysis by making it harder to assemble genomes and identify genetic mutations. They can also lead researchers to misinterpret the function of a gene.

Researchers use a technique called shotgun sequencing to identify these errors. This involves taking numerous copies of a DNA strand, breaking it up into random smaller pieces and then sequencing each piece separately. For a particularly complex genome, this process can generate several terabytes of data.

To identify data errors quickly and effectively, the Berkeley Lab and UC Berkeley team use 'Bloom filters' and massively parallel supercomputers. “Applying Bloom filters has been done before, but what we have done differently is to get Bloom filters to work with distributed memory systems,” says Aydin Buluç, a research scientist in Berkeley Lab's Computational Research Division (CRD). “This task was not trivial; it required some computing expertise to accomplish.”

The team also developed solutions for parallelizing data input and output (I/O). “When you have several terabytes of data, just getting the computer to read your data and output results can be a huge bottleneck,” says Steven Hofmeyr, a research scientist in CRD who developed these solutions. “By allowing the computer to download the data in multiple threads, we were able to speed up the I/O process from hours to minutes.”

The assembly process

Once errors are removed, researchers can begin the genome assembly. This process relies on computer programs to join k-mers — short DNA sequences consisting of a fixed number (K) of bases — at overlapping regions, so they form a continuous sequence, or contig. If the genome has previously been sequenced, scientists can use reference recorded gene annotations to align the reads. If not, they need to create a whole new catalog of contigs through de novo assembly.

“If assembling a single genome is like piecing together one novel, then assembling metagenomic data is like rebuilding the Library of Congress," says Jarrod Chapman. Pictured: Human Chromosomes. Courtesy Jane Ades, National Human Genome Research Institute.

De novoassembly is memory-intensive, and until recently was resistant to parallelization in distributed memory. Many researchers turned to specialized large memory nodes, several terabytes in size, to do this work, but even the largest commercially available memory nodes are not big enough to assemble massive genomes. Even with supercomputers, it still took several hours, days or even months to assemble a single genome.

To make efficient use of massively parallel systems, Georganas created a novel algorithm for de novo assembly that takes advantage of the one-sided communication and Partitioned Global Address Space (PGAS) capabilities of the UPC (Unified Parallel C) programming language. PGAS lets researchers treat the physically separate memories of each supercomputer node as one address space, reducing the time and energy spent swapping information between nodes.

Tackling the metagenome

Now that computation is no longer a bottleneck, scientists can try a number of different parameters and run as many analyses as necessary to produce very accurate results. This breakthrough means that Meraculous could also be used to analyze metagenomes — microbial communities recovered directly from environmental samples. This work is important because many microbes exist only in nature and cannot be grown in a laboratory. These organisms may be the key to finding new medicines or viable energy sources. 

“Analyzing metagenomes is a tremendous effort,” says Jarrod Chapman, who developed Meraculous at the US Department of Energy's Joint Genome Institute (managed by the Berkeley Lab). “If assembling a single genome is like piecing together one novel, then assembling metagenomic data is like rebuilding the Library of Congress. Using Meraculous to effectively do this analysis would be a game changer.”

--iSGTW is becoming the Science Node. Watch for our new branding and website this September.

--Like what you're reading? Subscribe to our weekly newsletter here.


Join the conversation

Do you have story ideas or something to contribute?
Let us know!

Copyright © 2015 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.