
We are living in the golden age of exoplanets - over 800 are known, and new discoveries are announced weekly. The Kepler mission, launched in 2009, has discovered over 100 of them, and has reported over 2,700 exoplanet candidates that are under active investigation by astronomers. Impressive as Kepler's achievements are, they only scratch the surface of the rich data set the mission has produced so far.
This scientific bounty is the result of Kepler's simple strategy of taking rapid snapshots of more than 150,000 stars in a patch of the sky in Cygnus for as long as 8 years, to seek the tiny periodic dips in the light of a host star as an exoplanet transits across its disk. At the end of the extended mission, it expects to release over 1 million light curves of these stars, many with more than 200,000 individual data points.
One of the ways in which astronomers study light curves is to calculate periodograms, which find statistically significant periodic variations through brute force analysis of every frequency present in the data. The schematic below shows how this process works. Now, these variations do not by themselves reveal the presence of an exoplanet, but are a starting point for more detailed analyses that rule out other source of variations, often important in their own right, that can mimic the variations caused by an exoplanet. Examples are a stellar companion that grazes the disc of the host star, or the presence of starspots on the host stars. Moreover, the periodicities found depend on the underlying assumptions about the shape of the variations.

Computing periodograms in bulk on a desktop machine isn't feasible: a single periodogram on a 3-GHz processor can take several hours for light curves having more than 100,000 points. Fortunately, all the frequencies can be sampled independently and can be computed in parallel. We have therefore set out to investigate how we can use high-performance computing platforms to process Kepler data. Our ultimate goal is to compute an atlas of the periodicities present in the entire Kepler data set and deliver it as a resource for astronomers to mine and analyze. Here, we describe the results of the first step in this enterprise: a pilot project to use the ANSI-C based periodogram code developed at the NASA Exoplanet Science Institute (NExScI) to understand how to process quarterly data sets released by Kepler on high-performance platforms such as Amazon EC2, Open Science Grid and others. The table above shows the results from this pilot project. All the platforms are able to support the calculations: the difference in performance is in fact mainly due to differences in the parameters used in the calculations. The Pegasus Workflow Management System proved invaluable in setting up a user-friendly environment. By planning the workflows across the compute resources, ensuring the workflows run as efficiently as possible without excessive stress on the cyber-infrastructure, and managing the transfer of data, Pegasus frees the astronomer from handling the details of running the applications. It proved especially valuable in the final run (see row 6 of the table below), which processed 1.1 million light curves on the SDSC Trestles cluster: Pegasus clustered the 2.2 million workflow tasks into 372 executable jobs.

We now have in place the infrastructure and tools required to complete the project, which will involve the much larger scale task of processing the data again, but this time after combining data from multiple quarters. Combining data in this manner is necessary because the spacecraft must rotate every three months to keep its solar panels pointed towards the Sun, when starlight falls on a different part of the telescope camera. Since Kepler is ultimately looking for transiting Earth-like planets with periods of up to a year, such that the transit occurs only once a year, combined data from multiple 'quarters' are needed to find these rare signals.
Find out more about the Pegasus Workflow Management System: 'Sharing science in the collaboratory'.