• Subscribe

Machine learning and the microbiome

Speed read
  • New study turns to the Gordon supercomputer to manage big data challenge of the microbiome
  • Machine learning algorithm trains supercomputer to differentiate between healthy and unhealthy bacterial proteins
  • Study paves the way for promising non-invasive therapies

Imagine trying to specify all the animals and plants in a complex ecology like a rain forest or coral reef. Now, imagine trying to do this in the gut microbiome, where each creature is microscopic and identified only by its DNA sequence.

Determining the health of that ecology is a classic big data problem, where the big data is provided by a powerful combination of genetic sequencing techniques and supercomputing software tools.

The challenge then becomes how to mine this data to obtain new insights into the causes of diseases, as well as develop novel therapies to treat them. <strong>Supercomputers spot sickness. </strong> A visualization of protein families is projected onto researchers Mehrdad Yazdani and Bryn Taylor in the UCSD <a href= 'http://jacobsschool.ucsd.edu/microbiome/'>Center for Microbiome Innovation.</a> Courtesy Bryn Taylor.

A new proof-of-concept study by a research team from University of California San Diego (UCSD), California Institute for Telecommunications and Information Technology’s (Calit2) Qualcomm Institute, and the J. Craig Venter Institute (JCVI) suggests there is great promise for new non-invasive diagnostic tools.

Go with your gut

As recent advances in scientific understanding of Parkinson’s disease and cancer immunotherapy have shown, our gut microbiomes – the trillions of bacteria, viruses, and other microbes that live within our large intestine – are emerging as one of the richest untapped sources of insight into human health. 

The problem is these microbes live in a very dense ecology of up to 1 billion microbes per gram.

Research began with a genetic sequencing technique known as ‘metagenomics,’ which considers the DNA of the hundreds of species of microbes that live in the gut.

Scientists applied the technique to 30 healthy people (using sequencing data from the National Institutes of Health’s Human Microbiome Program), together with 30 samples from people suffering from Inflammatory Bowel Disease (IBD), including those with ulcerative colitis and with illeal or colonic Crohn’s disease.

Since each bacterium contains thousands of genes and each gene can express a protein, this technique made it possible to translate the reconstructed DNA of the microbial community into hundreds of thousands of proteins, which were then grouped into about 10,000 protein families.<strong>Gutsy move. </strong> The top two graphs show a “training set” of 100 protein families. The lower two graphs are the results from the machine learning algorithm, which discovered the protein families that had similar patterns in the remaining 9,900 protein families. Courtesy Mehrdad Yazdani, et al.

In total, the team sequenced around 600 billion DNA bases and then fed them into a supercomputer to reconstruct the relative abundance of these microbial species.  

The team conducted their analysis on the Gordon supercomputer at the San Diego Supercomputer Center (SDSC) using 180,000 core-hours. (This is equivalent to running a desktop computer 24 hours a day for 20 years).

Metagenomics and machine learning 

To discover the patterns hidden in this huge pile of numbers, researchers used machine learning algorithms to classify major changes in protein families found in the gut bacteria. 

The team first identified the 100 most statistically significant protein families that differentiate health and disease states. They then used these families as a training set to build a machine learning classifier that could categorize the remaining 9,900 protein families in diseased versus healthy states. Researchers looked for a signature in which protein families were elevated or suppressed in disease vs. healthy states. 

Machine learning is akin to training a computer to recognize the different flavors of fruit juices – something a human toddler could do intuitively, albeit from a limited perspective.

“From your past experiences drinking juice, you know the difference between orange, apple, and cranberry juice,” notes Bryn C. Taylor, co-author of the study. “Your future decision about what juice you are drinking will be based on your past preferences. But it’s really hard to figure out what apple juice tastes like without experiencing it first.”

Likewise, the team had to train the computer to recognize what apple juice tastes like – or in this case, what a healthy microbiome looks like by clustering data according to bacteria.

“You can try to categorize healthy and sick people by looking at their intestinal bacterial composition,” explains Taylor, “but the differences are not always clear. Instead, when we categorize by the bacterial protein family levels, we see a distinct difference between healthy and sick people. This is because proteins are the workhorses of biology, and by analyzing the proteins produced by these bacteria, we can get an idea of what the bacteria are doing in your gut."<strong>Go, Gordon, go! </strong> 180,000 hours on the Gordon supercomputer enabled researchers to identify patterns among bacterial protein families. A new proof-of-concept offers a non-invasive method to identify the health of gut bacteria. Courtesy SDSC.

The machine learning approach is effective precisely because it’s statistically based, says lead author Mehrdad Yazdani, a machine learning and data scientist at the Qualcomm Institute. “The rules are not set in stone. What you need is past data and past experiences from patients, and then based on statistics or distribution you make your decisions. You let the data speak for itself,” he adds.

In the future, the researchers hope to expand their analysis, using SDSC’s Comet supercomputer, from 10,000 protein families to one million individual genes, each of which codes for a protein which can be expressed in the gut microbiome.

“We want a fast turn-around,” says Yazdani. “That’s really important, especially for clinical data.”

Research team included Mehrdad Yazdani, a machine learning and data scientist at the California Institute for Telecommunications and Information Technology’s (Calit2) Qualcomm Institute; Biomedical Sciences graduate student Bryn C. Taylor and Pediatrics Postdoctoral Scholar Justine Debelius; Rob Knight, a professor in the UC San Diego School of Medicine's Pediatrics Department as well as the Computer Science and Engineering Department and director of the Center for Microbiome Innovation; and Larry Smarr, Director of Calit2 and a professor of Computer Science and Engineering. The UC San Diego team also collaborated with Weizhong Li, an associate professor at JCVI.

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2023 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.