A new UK project aims to gather 10 times more genetic data than the 1000 Genomes Project to enable research into human genetic disease.
All human genes have already been identified - through the Human Genome Project. But, when our DNA was fully sequenced and counted, researchers were astounded to find that the genes themselves are not as numerous as they expected - there are only about 23,000 in total - which meant that genes alone didn't hold the answers to questions of inheritance and, perhaps more importantly, to disease.
Now, identifying which variant of what gene does what is the current challenge, being tackled at Wellcome Trust Sanger Institute in the UK.
The 1000 Genomes Project was the largest international research effort to establish the most detailed catalogue of human genetic variation. Now an even larger project, called UK10K and based at the Sanger Institute near Cambridge in the UK, aims to gather 10 times more genetic data. It aims to figure out susceptibility to human genetic diseases in the UK.
"The UK10K project aims to conduct a large-scale genome wide study of 10,000 individuals' genome sequences, to explore rare variants in different types of disease. We are keeping to schedule and the project is already larger than the 1000 Genomes Project in terms of data produced to date," said Thomas Keane, of the institute.
Rarest of them all
Scientists on the UK10K project are locating 'exomes' - the protein-coding parts of the DNA. Amazingly, exomes make up just 1% of the genome; the rest is non-coding DNA, sometimes referred to as 'junk DNA'. Understanding differences between human individuals with this small percentage of DNA will help complete the jigsaw of genetic links to many diseases including autism and congenital heart disease.
"[Exome sequencing] will allow us to identify novel genes associated with rare and common disorders and even identify variants within single individuals. We will assess the role they play in disease pathogenesis. With better understanding of the underlying genetics, we will have a better understanding of the biology behind these conditions," said Karola Rehnstrom, a UK10K researcher.
The UK10K project is scheduled to run for three years: it began in mid-2010 and is expected to finish by mid-2013. Results from the UK10K study will act as a large sample and control dataset that will then be used by genetic researchers to investigate the cause of susceptibilities to certain human diseases.
10,000 people to be sequenced
A total of 10,000 individual genomes are in the UK study. The first 4,000 will act as a healthy control sample that represents the general population. Approximately 2,000 are from the Twins UK study of King's College London's Department of Twin Research and Genetic Epidemiology and 2,000 people from the Avon Longitudinal Study of Parents and Children (ALSPAC), based at Bristol University.
"Twins can help us answer many genetic and environmental questions. Both monozygotic [identical] and dizygotic [non-identical] twins share the same intra-uterine environment and in most cases will share the majority of their childhood through to adulthood. Monozygotic twins also share 100% of their DNA," said Kirsten Ward from King's College, London.
"This helps us to begin to understand whether influences on common diseases such as type 2 diabetes are environmental, for example, from an unhealthy fatty diet and lack of exercise, or are genetic, such as a polymorphism [when multiple variants exist] in a gene that increases blood lipid levels," said Ward.
"For UK10K twins will act as controls against the disease groups when comparing the frequency of any rare variants found in the disease groups and whether the variants are also found within the controls - and vice versa," she said.
The other 6,000 study participants will have their exomes sequenced. They have one of a range of disorders with a genetic link: 1,000 have autism; 2,000 have schizophrenia; 2,000 have extreme obesity; and 1,000 with several rare monogenic disorders such as congenital heart disease, ciliopathies and severe insulin resistance. With these sequences, researchers will compare them with the first group to find changes in DNA that are responsible for particular diseases.
Planning for the future
The sequenced genomes of all these individuals will be stored and analyzed using the Sanger Institute's large-scale computing cluster. "We have more than 14,500 cores of computational capacity," said Phil Butcher, the head of IT at Sanger Institute.
"In total, the Sanger Institute has more than 12 petabytes [one petabyte is one million gigabytes] of storage capacity and it continues to grow. We have several large clusters or farms that sit on a common 10Gb/s [1.25 Gigabyte] backbone. This allowed us to implement several Lustre file systems [a large-scale parallel distributed file system] that can be mounted on any farm if required, but can also be dedicated to specific projects," Butcher said.
"The UK10K project is budgeted to use1.5petabytes consisting of a mixture of high performance Lustre and standard NFS [Network File System]," said Keane.
As the project collects more data, the Institute's computing infrastructure team is contemplating expansion. "In the future data sharing will be ever more important, so connections with other grid and network initiatives are always on the agenda.The institute is definitely at the multi petabyte scale. In the coming five years we estimate we may acquire data volumes in the 30 to 40 petabyte range," said Butcher. Apparently, the human brain is capable of storing just over two petabytes of memory.
"Another area of intense research by several groups is more efficient data formats and deciding how much of the raw data we actually need to store in the long term,"he said.
For further discussion, see the article on the 1,000 genomes project that discusses the problems with handling data as faced by molecular biologists on these projects.