Feature - Superlinks to identify genetic culprits

Feature - Superlinks to identify genetic culprits

A graphic map of a particularly complex family tree. The squares represent males, while the circles represent females. Individuals affected by a genetic mutation are represented with red squares or circles. Yellow lines indicate a marriage between relatives. Image courtesy of Kwanghyuk (Danny) Lee, Baylor College of Medicine.

Once scientists know which mutation causes a disease, they can apply that knowledge in their search for a cure. Likewise, doctors can recommend lifestyle changes that will alter the course of the disease. But the computer analysis used to identify these mutations would take years to complete on a single computer.

Superlink-online, a distributed system developed at the Technion-Israel Institute of Technology, helps researchers perform their analyses in a matter of days by distributing the computations over thousands of computers worldwide. Geneticists submit their data through the web portal with a single click and get their results via email, ready to use. Behind the scenes, the system splits the computations into hundreds of thousands of independent jobs, runs them using the available resources, and assembles the results back into a single data set.

Geneticists use Superlink-online to do statistical computations called genetic linkage analysis. The analysis maps the genealogy and the genetic makeup of the genealogy's members onto a graphical model that represents the likelihood of a gene being linked to a disease. The goal is to determine the location of the genes that provoke a disease.

Superlink ran on a single computer in 2002 when it was first released by Dan Geiger and his students at the Technion. By 2005, as more computer power was needed to perform increasingly complex analyses, then-doctoral student Mark Silberstein began working on a distributed version. Silberstein and his advisor, Assaf Schuster, realized that the only way to meet the research's demand for increasingly powerful computing was to enable opportunistic use of non-dedicated computers.

"[The data] was too complex to analyze on one CPU...," said Silberstein. "It was impossible to provide 'service' with this quality of service." The opportunistic model was chosen, explained Silberstein, because "with literally zero budget for purchasing and maintaining dedicated hardware, and with the actual resource demand reaching thousands of CPUs, we could not afford any other model."

In early 2006, thanks to close collaboration with the Condor team at the University of Wisconsin-Madison, the first version of Superlink-online was released. At the time, the system used UW-Madison's Condor pool and the Technion's own home-brewed Condor pool, which had about 100 CPUs. Eventually, additional resources came from the Open Science Grid, EGEE, and the Superlink@Technion community grid, which uses the idle cycles on the home computers of volunteers.

It's a powerful combination. During a 3-month period, over 25,000 non-dedicated hosts from the grids Superlink-online uses have been actively participating in the computations, reaching a maximum effective throughput roughly equal to that of a dedicated cluster of up to 8,000 cores.

With access to that much computing power, the system has enabled hundreds of geneticists worldwide to analyze much larger data sets, producing results 100 times faster than the serial version. As a result, several rare disease-causing mutations have been found, including the mutations that cause Hereditary Motor and Sensory Neuropathy, "Uncomplicated" Hereditary Spastic Paraplegia, and Ichthyosis.

What's next? Silberstein and his colleagues are finalizing a version of Superlink-online that will significantly extend its power. In addition to the aforementioned grids, the new system will access the Tokyo Institute of Technology's Tsubame supercomputer.

-Marcia Teckenbrock, for iSGTW and Open Science Grid