Feature - Grid in a cloud: Processing the astronomically large
(Editor's note: Alfonso Olias is part of a European Space Agency team working on a project that involves processing data from one billion stars - with some individual stars surveyed multiple times. Their solution? To run a grid inside a cloud. Here, he gives a first-hand account on their effort.)
We recently experimented with running a grid inside a cloud in order to process massive datasets, using test data drawn from something astronomically large: data from the Gaia project.
Gaia is a European Space Agency mission that will conduct a survey of one billion stars in our Galaxy - approximately 1% of the Milky Way galaxy. Over a five-year period, it will monitor each of its target stars about 70 times, precisely charting their positions, distances, movements and changes in brightness. Gaia is expected to discover hundreds of thousands of new celestial objects, such as extra-solar planets and failed stars called brown dwarfs, as well as identify tens of thousands of asteroids within our own solar system.
There are other sky survey projects, such as the Sloan Digital Sky Survey. Gaia is different from the others in that the primary objective of Gaia is astrometry: the precise measurement of positions and motions, using a satellite designed specifically for that purpose. For various environmental reasons - such as distortions caused by the Earth's atmosphere - high-precision, absolute astrometry can not be performed on ground. Gaia will "zoom in" to extract cutouts around sources of interest, while ground-based projects such as Sloan will make complete images of the entire sky.
Gaia will detect and characterize tens of thousands of extra-solar planetary systems and provide a comprehensive survey of objects ranging from the huge number of minor bodies in our solar system, through galaxies in the nearby universe, to distant quasars. Furthermore, the data will provide stringent new tests of Albert Einstein's theory of general relativity.
The in-house processing environment was replicated by configuring an Oracle instance (an image of Oracle running on a virtual machine) to contain the scientific data, and a whiteboard was used to publish computed jobs.
In order to execute the jobs and process the data, an in-house distributed computing framework was configured to run the Astrometric Global Iterative Solution (AGIS), which runs a number of iterations over the data until it converges.
"Only one line of code was changed in order to run it on the cloud."
The system works as follows: Working nodes get a job description from the database, retrieve the data, process it and send the results to intermediate servers. These intermediate servers run dedicated algorithms and update the data for the following iteration. The process continues until the data converges.
* The amount of data increases over the 5-year mission.
In order to port to the cloud, Amazon Machine Images (AMIs) were configured for the Oracle database, the grid and the AGIS software. The result is an Oracle grid running inside an Amazon cloud; a full relational database rather than a cloud database service. Five cloud storage volumes of 100 gigabytes each were attached to the database virtual machine (VM). Another VM was configured with the grid and the AGIS software. Only 1 line of code was changed in order to run it on the cloud.
To process 5 years of data for 2 million stars, 24 iterations of 100 minutes each were done, which translates into 40 hours of running a grid of 20 Amazon Elastic Compute Cloud (EC2) high-CPU instances.
For the full billion-star project, 100 million primary stars will be analyzed, plus 6 years of data, which will require a total of 16,200 hours on a 20-node EC2 cluster. The estimated cost calculated for the cloud-based solution is less than half the cost of an in-house solution, even when the additional electricity and system administration costs of the in-house solution are not taken into account.
A second test was done by running 120 High CPU Extra Large Virtual Machines (VMs). Each VM was running 12 threads, so there were 1440 processes working in parallel. Performance problems associated with SQL queries and lock contention at the database were detected and resolved, which could not have been found with the current cluster. Thus, the cloud allowed us to find and solve performance and scalability problems before going to production.
Using cloud computing we can scale up to massive capacities (both processing and storage) in a matter of minutes without having to invest in new infrastructure, train new personnel or license new software. We can have a peak load capacity without incurring the higher costs of building larger data centers and maintaining the servers and networks.
-Alfonso Olias - The Server Labs, at the European Space Agency (ESA).