Bioconductor: harmonizing distributed genomic analysis

Worldwide activity of the Bioconductor project (click for large version). The map shows the spatial distribution of 1.1 million total sessions on the Bioconductor website in the year up to October 2014. Image courtesy Wolfgang Huber.

Since the advent of the human genome project at the turn of the century, big data has had researchers looking for analytical tools to bring order from the chaos. Creating a maestro to orchestrate the cacophony is the aim of genomic analysis projects like .

Bioconductor is an open source array of software packages that enable worldwide dissemination and analysis of genomic data. Launched in the fall of 2001, Bioconductor provides a common set of tools to facilitate bioinformatic statistical analysis, including high-throughput sequencing, , , and other genetic data.

The software suite, , is developed in the statistical computing language by scientists around the world; a team based primarily at the in the US leads the effort.

In some ways, the story of Bioconductor is the story of big data. Prior to tools like Bioconductor, genomic research relied on for processing sequence and text-like data. But there were few tools for working with quantitative data types like microarray data, says Wolfgang Huber, senior scientist at the in Heidelberg, Germany.

“The working mode was that computation-savvy academic labs would build up analysis systems from scratch, with little regard for interoperability or code re-use,” says Huber. “Lab biologists lived with the expectation that they could satisfy their needs with point-and-click software. Code tended to be closed, obscure, non-portable, and insular.”

The result was a discordant score for deciphering the human genetic code. In contrast, analysis projects like Bioconductor have created a transparent, reproducible code source as a common language for users, and built a worldwide community to foster developers from among subject matter experts.

To say that anything accomplished with Bioconductor could not have been accomplished without it would be arrogant, Huber admits. Scientists, after all, are an inventive lot. Nevertheless, , , and the US are examples of projects in which Bioconductor was heavily used and helped speed progress, Huber says.

More generally, Bioconductor is responsible for the rapid adoption of microarrays in many research hospitals around the world and other functional assays like ChIP-Seq and RNA-Seq, Huber says.

Bioconductor partner NHGRI hosts the (ENCODE) database, housing in excess of 1.1 petabytes of data. Bioconductor's global reach indicates it is responsive to an even larger scale of throughput.

The volume of medical data processed by the Bioconductor project over the last 14 years, though difficult to measure, is likely in the petabyte scale at this point. “More important than the number of bytes,” Huber notes, “is the complexity and richness of the data, the number of cross-relationships, or the importance of a certain disease to society.”

To hold and transfer these large datasets, Bioconductor looks to resources such as the , TCGA, or . In the US, other repositories for storing, cataloging, and accessing cancer genome sequences include those held by at UC Santa Cruz, and institutes like the and the at Memorial Sloan-Kettering Cancer Center.

Big data successes

As outlined in Huber's article, Bioconductor is a resounding success. Success for big data echoes in many registers, however. Under the model of precision medicine, recently prioritized in the US by , data flows from patient to an analysis engine like Bioconductor and finally to a clinic, where the happy end result is a pharmaceutical product for the patient's malady.

Image courtesy Wolfgang Huber.

Big data has shown the promise of precision medicine by enabling the production of drugs such as imatinib for leukemia, trastuzumab for breast cancer, and gefitinib for lung cancer. Often, the efficacy of a pharmaceutical remedy hinges on a tumor's molecular portrait, so analytical software like Bioconductor provides invaluable assistance to healthcare practitioners. Certain drugs are only effective in patients with a particular mutation, says Carolyn Hutter, program director in the division of genomic medicine at the NHGRI.

Hutter points to the as an example of how big data is exploring the potential of precision medicine. The trials look for changes in the ALK and EGFR genes, both thought to drive cancer growth. Once a patient's tumor is removed, they are given rizotinib or erlotinib to see how well the drugs prevent recurrence and improve cancer survival.

Beyond direct pharmaceutical applications, helping determine a cancer's severity is another measure of big data's success in oncology. Thanks to the increased access to genomic information provided by analytical tools like Bioconductor, scientists have learned molecular characterization may be more important than a cancer's tissue of origin.

Classifying tumors based on their molecular profile is informative to the patient and their physician; knowing what is happening with a cancer can yield a very different prognosis and course of treatment. “If you can separate people with benign cancer from those with malignant tumors, then you can limit more aggressive treatments – which may have more severe side effects – only to those patients who need them,” says Hutter.

But perhaps the most important metric of success for big data is the scaffolding it provides for future, yet unforeseen, advances in health practice and human knowledge. When placed in a larger epistemological context, it is hard to overstate the function big data and tools like Bioconductor serve as we come to understand the wealth of information under our fingertips.

Bioconductor: harmonizing distributed genomic analysis

Share this story

Tags

Join the conversation

Our Underwriters

Categories

Contact

Science Node

Republish

Discover More

Subscribe to our newsletter

Login to ScienceNode

Bioconductor: harmonizing distributed genomic analysis

Share this story

Tags

Join the conversation

Our Underwriters

Categories

Contact

Science Node

Republish

Discover More