When the sequencing of the human genome was officially declared complete in 2003, it was the culmination of 13 years of work, by dozens of research sites over 18 countries, involving a partnership between the public and private fields. The publicly funded portion alone cost an estimated $3 billion.
Nowadays, the sequencing of the complete genome of a simple organism can be done in a matter of hours, at a tiny fraction of the cost, thanks to a new generation of analytical instruments. Determining the order of the nucleotides in a DNA molecule (DNA sequencing) is down to a matter of a few dollars. Consequently, big sequencing projects have shifted to determining the specific sequences of small populations of individuals, giving us the ability to hone in on the differences between them at sequence level (variants) and how they relate to specific individual traits, such as those causing diseases like cancer.
But a bottleneck has appeared, in the form of the alignment and post-processing of the huge amount of data. To minimize both processing time and memory requirements, the Bioinformatics Unit, Structural Biology and Biocomputing Program at the Spanish National Cancer Research Center (CNIO) recently teamed up with The Server Labs in an experiment to see if they could develop a cloud-based solution that would meet their genomic processing needs.
With the pay-per-use concept of the cloud, CNIO stood to enjoy multiple benefits: the organization would have to spend less time and money in maintaining and upgrading their internal IT department. They would also spend less for purchases and upgrades of computational resources, software licenses, as well as on external resources and the salaries of expert administrators.
There were other potential benefits as well. As the number of sequencing experiments which the CNIO runs can vary greatly, the cloud eliminates potential over-investing in equipment, and allows for access to further resources on an as-needed basis from the cloud provider.
In addition, in comparison to those working with supercomputers or "Big Iron," researchers on the cloud could more easily share results while controlling access. And by storing their experimental data in the cloud, researchers can ensure their data is safely replicated among data centers, as a form of backup.
To run the cloud case study, the team executed some typical genomic workflows. The first step was identifying a suitable computational environment, including hardware architecture, operating system and genomic processing tools. They identified 5 software packages for genomic testing, all of which were open source and freely available.
One of the requirements of these tools was that the underlying hardware architecture is 64-bit. For initial proof of concept, they decided to run a base image with Ubuntu 9.10 for 64-bit on an Amazon EC2 large instance with 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) and 850 GB of local instance storage. Once they had selected the base image and instance type to run on the cloud, the team proceeded to automate the installation and configuration of the software packages. (Automating this step ensured that no additional setup tasks are required when launching new instances in the cloud, and provided a controlled and reproducible environment for genomic processing.)
By using a cloud-management platform called RightScale, they were able to separate out the selection of instance type and base image from the installation and configuration of software specific to genomic processing. First, they created a server definition for the instance type and operating system specified above. The team then scripted the installation and configuration of genomic processing tools, as well as any OS customizations, so that these steps can be executed automatically after new instances are first booted.
Once the new instances were up and running and the software environment finalized, they executed some typical genomic workflows suggested by CNIO.
They found that for their typical workflow with a raw data input between 3 and 20 GB, the total processing time on the cloud would range between 1 and 4 hours, depending on the size of the raw data and the nature of the sequencing experiment. The cost of pure processing tasks totaled less than two dollars for a single experiment.
CNIO's genomic facilities were able to process up to 20-25 sequencing runs in a sequencer. On average, they expect to analyze about 150 sequencing lanes per year, each generating 30 gigabyte of entry data on average, and totaling up to 3-4.5 terabytes in storage and processing requirements.
They found that processing times in the cloud were comparable to running the same workflow in-house on similar hardware. However, when processing in the cloud, transferring input data to Amazon could become a bottleneck. They said that they worked around this limitation by "Processing our data on Amazon's European data center and avoiding data transfer during peak hours."
Maximizing the Advantages of the Cloud
To truly realize the benefits of the cloud, however, the team found that they needed an architecture that allows tens or hundreds of experiment jobs to be processed in parallel. This would allow researchers, for instance, to run algorithms with slightly different parameters to analyze the impact on their experiment results. At the same time, they wanted a framework which incorporates all of the strengths of the cloud, in particular data durability, publishing mechanisms and audit trails to make experiment results reproducible.
To meet these goals, The Server Labs is developing a genomic processing platform which builds on top of RightScale's RightGrid batch processing system. They expect that the platform will facilitate the processing of a large number of jobs by leveraging Amazon's EC2, SQS, and S3 web services in a scalable and cost efficient manner to match demand. The framework cloud also take care of scheduling, load management, and data transport, so that the genomic workflow can be executed locally on experiment data available to the EC2 worker instance.
To make genomic processing even simpler on the cloud, the on-demand model could be taken even one step forward by providing a pay-as-you-go software as a service.
-Gerardo Viedma, Alfonso Olias and Paul Parsons