You may think the tree of life was settled a long time ago (or maybe you don't believe in it at all), but scientists continue to refine, and sometimes radically alter, our understanding of how species are related to each other. Whereas once, evolutionary history was based on the relationships of bones, skeletons and other morphological clues, today, DNA is the main informer in the story of how the Earth became such a diverse place.
Phylogenetics is the branch of life science that studies the evolutionary relationships among organisms based on genetic evidence. By aligning the molecular sequences of different species (from genes, but also from transcriptomes and proteins), it is possible to see how organisms differ at the genetic level. One can then determine where species diverged and create branching trees of relationships based on the alignments.
With the cost of gene sequencing in decline, researchers are performing more phylogenetic studies, helping them draw new conclusions about how organisms, or specific traits, have evolved. However, the process of lining up tens of thousands of sequences from hundreds or thousands of different species is incredibly complicated, even for a computer.
"The most accurate trees are estimated using methods that try to solve hard optimization problems," said Tandy Warnow, a computer scientist at The University of Texas at Austin (UT-Austin) and a Guggenheim Fellow. "While those solutions can be done on small datasets or moderate sized data sets, on large datasets, they can take a very long time - weeks to months to years of computational time. The Texas Advanced Computing Center ends up being essential for those problems."
The Texas Advanced Computing Center runs some of the biggest and most powerful systems in the world, but even their supercomputers can hardly keep up with the pace of genetic research. According to Moore's law, the performance of computers doubles every two years; however, the ability of gene sequencers to create data has grown at an even faster rate.
"It's a different kind of challenge," Warnow said. "It's not just how we run analyses on big data sets, but how do we access the data in a way that is sensible?"
Warnow, working with post-doc Kevin Liu (Rice University) and Siavash Mirarab, a PhD student at UT-Austin, has been addressing these problems by creating smarter, faster, and more accurate algorithms and applying them to some of the biggest datasets ever created. With support from the National Science Foundation (through the Assembling the Tree of Life project), she and her colleagues have developed software that allows computers to draw better evolutionary trees faster.
Divide and Conquer
The software Warnow's group developed over the course of several years is called SATé: Simultaneous Alignment and Tree Estimation. The method uses a novel divide-and-conquer approach.
"By dividing a really big data set that's hard to align into small data sets that are closely related, you can get good estimates on each subset and then get an alignment on the full data set," Warnow explained.
Massive supercomputers, like Ranger at TACC, align the sequences of each subset and combine the alignments into an alignment on the full set of sequences.
There's no way to know if the tree that emerges from these simulations is absolutely accurate. Some trees are obviously wrong - for example, those that show humans and crocodiles on the same branch, separated from chimps - but most are probable.
For that reason, SATé uses a statistical method to provide a maximum likelihood score: a measure by which to assess its accuracy against other answers. SATé repeats the process of alignment and tree-building many times until a tree with the highest likelihood score is reached.
In software development, it's not enough to invent a new product. One must also prove the product is better than the alternatives. To this end, Warnow and her team have been working as quality assurance and reliability testers, solving hard evolutionary tree problems multiple times, with different methods and parameters, to ensure that SATé produces the highest-quality result.
First reported in Science and later explored in PLoS Currents and Systematic Biology, the researchers have shown again and again that SATé works as well as the alignment and tree estimation methods that are commonly used, but far faster, or with greater accuracy but in the same amount of time.
For the Birds
Warnow and her team's efforts go beyond algorithmic and software development. They also collaborate with evolutionary biologists on projects where their methodological improvements can lead to new insights.
Since Charles Darwin's day, scientists have debated the evolutionary history of flightless birds, known at ratites. How did so many similar species get to the far-flung corners of the Earth?
"The theory of continental drift provided a convenient answer," said Michael Braun, a curator in the department of systematic biology at the Smithsonian Institute. "These birds evolved from a common flightless ancestor and then drifted to their current distributions. For 40 years, this remained the textbook explanation of species dispersal."
That is, until Braun discovered through DNA analysis that an ancient family of birds found in South American, the tinamou, was one of the most closely related groups to emus and ostriches - and they could fly!
This fact, combined with the lack of skeletal evidence for flightless birds before the time of continental breakup, led to a re-conceptualization of the ratite branch of the avian tree. Ratites were in fact descended from flying birds that travelled to places where flight was no longer an evolutionary advantage, and consequently lost their ability to fly.
By improving the quality of the avian tree of life, a new history emerged.
"It's hard to recognize the relationships among species using just morphology, but when we can use the molecules and appropriate analytical methods to find the relationships, it helps us understand better how that adaptive evolution has occurred," Braun said.
Recently, Warnow worked with Braun, using SATé, to reanalyze his controversial findings. Their study confirmed the evolutionary relationship that Braun found.
Beyond telling us about the family history of the dodo bird, better, faster, more accurate phylogenetic methods can have a life or death impact. The Centers for Disease Control use sequence alignment and evolutionary tree-building tools when a new virus emerges to determine where it may have come from and how it differs from previous viruses. Plant scientists also use tree-building tools to determine which genes are associated with positive traits like hardiness and drought tolerance. This knowledge is enabling scientists to breed more productive crops, helping to feed the world. But none of these problems are easily solved.
"Many research groups are estimating trees containing anywhere from a few thousand to hundreds of thousands of species, towards the eventual goal of estimating a Tree of Life, containing perhaps as many as several million leaves," Warnow wrote in a recent article in Systematic Biology. "These phylogenetic estimations present enormous computational challenges, and current computational methods are likely to fail to run even on datasets in the low end of this range."
In other words, small problems may be within reach, but the big ones remain.
"It's not getting any easier, but it is getting more fun," Warnow said.
A version of this story first appeared on the TACC website.