- New study turns to the Gordon supercomputer to manage big data challenge of the microbiome
- Machine learning algorithm trains supercomputer to differentiate between healthy and unhealthy bacterial proteins
- Study paves the way for promising non-invasive therapies
Imagine trying to specify all the animals and plants in a complex ecology like a rain forest or coral reef. Now, imagine trying to do this in the gut microbiome, where each creature is microscopic and identified only by its DNA sequence.
Determining the health of that ecology is a classic big data problem, where the big data is provided by a powerful combination of genetic sequencing techniques and supercomputing software tools.
The challenge then becomes how to mine this data to obtain new insights into the causes of diseases, as well as develop novel therapies to treat them.
A new proof-of-concept study by a research team from University of California San Diego (UCSD), California Institute for Telecommunications and Information Technology’s (Calit2) Qualcomm Institute, and the J. Craig Venter Institute (JCVI) suggests there is great promise for new non-invasive diagnostic tools.
Go with your gut
As recent advances in scientific understanding of Parkinson’s disease and cancer immunotherapy have shown, our gut microbiomes – the trillions of bacteria, viruses, and other microbes that live within our large intestine – are emerging as one of the richest untapped sources of insight into human health.
The problem is these microbes live in a very dense ecology of up to 1 billion microbes per gram.
Research began with a genetic sequencing technique known as ‘metagenomics,’ which considers the DNA of the hundreds of species of microbes that live in the gut.
Scientists applied the technique to 30 healthy people (using sequencing data from the National Institutes of Health’s Human Microbiome Program), together with 30 samples from people suffering from Inflammatory Bowel Disease (IBD), including those with ulcerative colitis and with illeal or colonic Crohn’s disease.
Since each bacterium contains thousands of genes and each gene can express a protein, this technique made it possible to translate the reconstructed DNA of the microbial community into hundreds of thousands of proteins, which were then grouped into about 10,000 protein families.
In total, the team sequenced around 600 billion DNA bases and then fed them into a supercomputer to reconstruct the relative abundance of these microbial species.
The team conducted their analysis on the Gordon supercomputer at the San Diego Supercomputer Center (SDSC) using 180,000 core-hours. (This is equivalent to running a desktop computer 24 hours a day for 20 years).
Metagenomics and machine learning
To discover the patterns hidden in this huge pile of numbers, researchers used machine learning algorithms to classify major changes in protein families found in the gut bacteria.
The team first identified the 100 most statistically significant protein families that differentiate health and disease states. They then used these families as a training set to build a machine learning classifier that could categorize the remaining 9,900 protein families in diseased versus healthy states. Researchers looked for a signature in which protein families were elevated or suppressed in disease vs. healthy states.
Machine learning is akin to training a computer to recognize the different flavors of fruit juices – something a human toddler could do intuitively, albeit from a limited perspective.
“From your past experiences drinking juice, you know the difference between orange, apple, and cranberry juice,” notes Bryn C. Taylor, co-author of the study. “Your future decision about what juice you are drinking will be based on your past preferences. But it’s really hard to figure out what apple juice tastes like without experiencing it first.”
Likewise, the team had to train the computer to recognize what apple juice tastes like – or in this case, what a healthy microbiome looks like by clustering data according to bacteria.
“You can try to categorize healthy and sick people by looking at their intestinal bacterial composition,” explains Taylor, “but the differences are not always clear. Instead, when we categorize by the bacterial protein family levels, we see a distinct difference between healthy and sick people. This is because proteins are the workhorses of biology, and by analyzing the proteins produced by these bacteria, we can get an idea of what the bacteria are doing in your gut."
The machine learning approach is effective precisely because it’s statistically based, says lead author Mehrdad Yazdani, a machine learning and data scientist at the Qualcomm Institute. “The rules are not set in stone. What you need is past data and past experiences from patients, and then based on statistics or distribution you make your decisions. You let the data speak for itself,” he adds.
In the future, the researchers hope to expand their analysis, using SDSC’s Comet supercomputer, from 10,000 protein families to one million individual genes, each of which codes for a protein which can be expressed in the gut microbiome.
“We want a fast turn-around,” says Yazdani. “That’s really important, especially for clinical data.”