• Subscribe

Opening the spigot at XSEDE

Speed read
  • Genetic sequencing research gets boost from computational breakthrough.
  • Transcriptomes clue researchers to cellular changes.
  • XSEDE supercomputers shouder massive data loads. 

A boost from sequencing technologies and computational tools is in store for scientists studying how cells change which of their genes are active.

Researchers using the Extreme Science and Engineering Discovery Environment (XSEDE) collaboration of supercomputing centers have reported advances in reconstructing cells’ transcriptomes — the genes activated by 'transcribing' them from DNA into RNA.

The work aims to clarify the best practices in assembling transcriptomes, which ultimately can aid researchers throughout the biomedical sciences.

<strong>Digital detectives. </strong>Researchers from Texas A&M are using XSEDE resources to manage the data from transcriptome assembly. Studying transcriptomes will offer critical clues of how cells change their behavior in response to disease processes.

“It’s crucial to determine the important factors that affect transcriptome reconstruction,” says Noushin Ghaffari of AgriLife Genomics and Bioinformatics, at Texas A&M University. “This work will particularly help generate more reliable resources for scientists studying non-model species” — species not previously well studied.

Ghaffari is principal investigator in an ongoing project whose preliminary findings and computational aspects were presented at the XSEDE16 conference in Miami in July. She is leading a team of students and supercomputing experts from Texas A&M, Indiana University, and the Pittsburgh Supercomputing Center (PSC).

The scientists sought to improve the quality and efficiency of assembling transcriptomes, and they tested their work on two real data sets from the Sequencing Quality Control Consortium (SEQC) RNA-Seq data: One of cancer cell lines and one of brain tissues from 23 human donors.

What's in a transcriptome?

The transcriptome of a cell at a given moment changes as it reacts to its environment. Transcriptomes offer critical clues of how cells change their behavior in response to disease processes like cancer, or normal bodily signals like hormones.

Assembling a transcriptome is a big undertaking with current technology, though. Scientists must start with samples containing tens or hundreds of thousands of RNA molecules that are each thousands of RNA 'base units' long. Trouble is, most of the current high-speed sequencing technologies can only read a couple hundred bases at one time.

So researchers must first chemically cut the RNA into small pieces, sequence it, remove RNA not directing cell activity, and then match the overlapping fragments to reassemble the original RNA molecules.

Harder still, they must identify and correct sequencing mistakes, and deal with repetitive sequences that make the origin and number of repetitions of a given RNA sequence unclear.

While software tools exist to undertake all of these tasks, Ghaffari's report was the most comprehensive yet to examine a variety of factors that affect assembly speed and accuracy when these tools are combined in a start-to-finish workflow.

Heavy lifting

The most comprehensive study of its kind, the report used data from SEQC to assemble a transcriptome, incorporating many quality control steps to ensure results were accurate. The process required vast amounts of computer memory, made possible by PSC’s high-memory supercomputers Blacklight, Greenfield, and now the new Bridges system’s 3-terabyte 'large memory nodes.'

<strong>Building bridges.</strong> Bridges, a new PSC supercomputer, is designed for unprecedented flexibility and ease of use. It will include database and web servers to support gateways, collaboration, and powerful data management functions. Courtesy Pittsburgh Supercomputing Center.

“As part of this work, we are running some of the largest transcriptome assemblies ever done,” says coauthor Philip Blood of PSC, an expert in XSEDE’s Extended Collaborative Support Service. “Our effort focused on running all these big data sets many different ways to see what factors are important in getting the best quality. Doing this required the large memory nodes on Bridges, and a lot of technical expertise to manage the complexities of the workflow.”

During the study, the team concentrated on optimizing the speed of data movement from storage to memory to the processors and back.

They also incorporated new verification steps to avoid perplexing errors that arise when wrangling big data through complex pipelines. Future work will include the incorporation of 'checkpoints' — storing the computations regularly so that work is not lost if a software error happens.

Ultimately, Blood adds, the scientists would like to put the all the steps of the process into an automated workflow that will make it easy for other biomedical researchers to replicate.

The work will provide tools to help other scientists improve our understanding of how living organisms respond to disease, environment and evolutionary changes, the scientists reported.

Join the conversation

Do you have story ideas or something to contribute?
Let us know!

Copyright © 2017 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.