In 2008, molecular biologists around the world joined forces and set out on an ambitious three-year project: to sequence the genome of 1,000 people. Called the 1000 Genomes Project, they hoped to identify the common gene variants across nationalities and to identify the genetic susceptibility of many diseases.
What they didn't fully anticipate was just how rapid the increase in sequencing technology would be. "Our institute joined the project because we wanted to [perform] second generation sequencing technology. Now, our genome tool kit can analyze thousands more bytes of data - cheaper and faster - than three years ago," said Li Yingrui from the BGI (formerly the Beijing Genomics Institute, which dropped the name when the headquarters moved to Shenzhen).
So, when they finished sequencing the first 1,000 genomes in mid-2010, they moved the target, and now they are aiming to sequence more than double the original amount: 2,500 genomes instead. The new bottleneck in the project, though, became the efficient transfer and analysis of genetic data after a genome has been sequenced.
The data generated by the project, which is co-led by David Altshuler from the Broad Institute in Cambridge, USA, and Richard Durbin from the Sanger Institute near Cambridge in the UK,is held by and distributed from the European Bioinformatics Institute (EBI) and the US National Center for Biotechnology Information (NCBI), which is part of US National Institute of Health. There will also be a mirror website for data access in Shenzhen (China).
But for now, the largest sequenced data are often shipped between sites by mail.
"I know this is absurd"
"One genetic sequencer can generate half a terabyte of nucleotides [basic structural unit of DNA] per run in one week. There are thousands of sequencers producing data throughout the world," said Li.
"Once genetic data is processed, it is copied to hard disks and sent via mail to another institute for analysis synchronization. I know this is absurd, but this is a fact," said Li.
It may only take a week to generate half a terabyte of data, but after it's generated, researchers can spend up to two weeks copying out data, mailing it and then having it uploaded onto a new machine for analysis. This is because current Internet bandwidth speeds are too slow.
When cloud stops being cost effective
"The main issue for us is that our data sizes are so large, that the cost and difficulty of moving the data to the cloud stops it being cost effective for many jobs. We do use the cloud for the Ensembl genomes database, but only to provide [data] mirrors that are closer to users," said Phil Butcher, Head of IT at the Sanger Institute, one of the major research institutes involved in the project and located near Cambridge in the UK.
"We have looked at volunteer computing, but it has never seemed sensible because of data and network issues. We distressingly often resort to shipping hard disks around to transfer data between centers, rather than use the internet, or even via Aspera which is faster than ftp [file transfer protocol]," Richard Durbin said.
A team of scientists from labs around the world - a type of academic social network - is the way forward, Li said. Data could be then be stored and analyzed in an academic computing cloud which researchers could access remotely. It's such an issue for them that the BGI has an open access journal dedicated to the topic: Giga Science.
The show must go on
While data transfer issues continue to distress those in charge, the science coming out of the project nevertheless continues at a fast pace. From the first phase of the project - when the 1,000 genomes were sequenced - the teams found that each person carries approximately 250 to 300 loss-of-function variants, which result in the gene having less or no function,and 50 to 100 variants previously implicated in inherited disorders.
More basically, though, the project plans to characterize over 95% of variants that have a frequency of 1% or higher in each of five major population groups (populations in or with ancestry from the Americas, East Asia, South Asia, Europe, and West Africa).
This will form a "high-resolution genetic map" said Li. This map will then form a baseline for future studies, such as identification of genetic susceptibility to disease.
In fact, the leaps in sequencing technology have allowed the project to increase in scope. "The 1000 Genomes Project is now sampling from several more populations than were originally proposed," said Thomas Keane, researcher at the Sanger Institute.
"Now we can focus on individual ethnic groups," Li said. BGI contributes the genomes of two main ethnic Chinese groups to the 1000 genomes project: The North Han, the southern Han (the largest ethnic group in the world) and the sparse Dai people.
The field of molecular biology won't stop here. Next week, iSGTW will carry a feature about a more focused project with even more genomes, the UK10K project, which will sequence parts of the genome of 10,000 people in the UK.