A cosmic scavenger hunt is on, as NASA's Kepler Mission scours the sky for Earth-like planets.
Since it began taking data nearly two years ago, the space-based instrument has already been successful beyond expectations. With last week's announcement of six confirmed planets and several hundred candidate planets, it has identified a total of 15 confirmed and 1235 candidate planets. Among those, 68 are Earth-sized and 54 are located in their star's habitable zone (close enough to the star's heat to keep water from freezing, but far enough that water will not evaporate into steam - necessary characteristics for a planet to support life like ours). The two categories do overlap, in the case of five Earth-sized candidate planets in habitable zones.
All those candidates means a lot of data to analyze and store. So it shouldn't come as a surprise to hear that the Kepler Mission is supported by varied and innovative cyberinfrastructure.
It came from space!
Nearly 30 million kilometers away, in the cold vacuum of space, Kepler orbits the Sun as it gathers data from a patch of sky located in the constellations of Cygnus and Lyra. This particular region, which covers 1/400 of the entire sky, was chosen for several reasons.
First, to detect planets, Kepler has to observe several transits in which a planet passes in front of its own star, causing a flicker in the star's brightness. If that orbit is at all like Earth's, viewing even three such transits could take more than three years of observation. And since different planets in different solar systems will transit at different times in our year, that means that the patch of sky Kepler observes must be visible year-round. In the northern hemisphere, where most of the NASA installations capable of following up on Kepler's findings are located, the patch of sky Kepler will observe for its 3.5-year mission is the only patch of sky densely filled with stars that is never blocked by the Sun, Moon, or Earth.
Second, while Kepler is seeking planets of all kinds, its special goal is to find Earth-like planets. An Earth-like planet must have a star similar to our own. Furthermore, Kepler can only see the planetary transit if the planetary system is nearly perfectly aligned to face us. Yet all orientations are equally likely, and so the probability that a randomly oriented planetary system will be aligned so that Kepler can detect the transit of an Earth-like planet is only 0.5 percent. To increase the likelihood of finding this scavenger hunt's top prize, scientists determined that Kepler has to observe 100,000 stars. The patch of sky Kepler is observing is sufficiently rich in stars to meet that requirement.
Every 6.5 seconds, Kepler adds the last 6.5 seconds worth of data to the running total, resulting in a single image; this process, called co-adding, is performed by a Field Programmable Gate Array that was programmed before Kepler was launched. After repeating this process for a half hour, Kepler's RAD750 computer compresses and stores the recorded cumulative image, while simultaneously beginning to record a new one. Given that scientists are only interested in the postage stamp-sized images of each star and the related calibration pixels, most of the pixel data is discarded, leaving only six percent to be compressed and stored onboard.
To avoid the introduction of errors into their data during compression, SETI Institute scientist and Kepler Analysis Lead Jon Jenkins had to design a lossless compression algorithm that would allow them to store as much data as possible -- approximately 66 days of data, in fact, requiring only 4.6 bits per pixel.
"We do, however, manage the quantization noise of the measurements as an element of the error budget," Jenkins said. "So that allows us to compress from 23 bits per pixel measurement to 16 bits per pixel in the first stage of the compression."
Once each month, the data are stored as CCSDS packets, with Reed-Solomon encoding to protect against bit errors during transmission. Then Kepler's High Gain Antenna is re-oriented to point at Earth so that Kepler can downlink at Ka-band to one of the Deep Space Network stations here on Earth. This entire process generally causes a break in data collection of up to two days. So far, their approach has been successful in avoiding error.
"We've lost very little engineering data and no science data to bit errors," Jenkins said. "We do retransmit data that failed to come down the first time, mostly due to weather-canceled contacts and equipment outages."
Because there is enough space on-board to store 66 days of data, a corrupted transmission could simply be added on to next month's downlink. But Kepler can also be commanded in real time from Kepler's Missions Operations Center at the Laboratory for Atmospheric and Space Physics in Boulder, Colorado, USA. That way, if the Kepler team's twice weekly transmission of engineering and science data suggests that there is something wrong, they can do something about it.
Back on Earth...
The SOC uses four clusters consisting of 584 CPUs, 2.9 TB of RAM, and 148 TB of raw storage; currently 30 TB of that storage is occupied. The infrastructure and pipeline control is written in Java, and all the science algorithms are written in MATLAB.
I'm to blame for MATLAB: It's been a mainstay for me in terms of developing my own algorithms and pipelines for Kepler and for my previous work on remote sensing of Venus' atmosphere. MATLAB is very flexible and is very efficient with respect to allowing you to quickly identify bugs and issues and resolve them quickly - and it is very efficient so long as you take advantage of the built-in capability of vectorizing your code.
The software engineers we hired to develop the pipeline infrastructure chose Java because it's become the standard in the industry for transaction-based processing." -Jon Jenkins, Analysis Lead, Kepler Mission
Analysing the data Kepler sends home each month is a five-step process, beginning with pixel calibration. This puts the data on a linear scale with a well-defined zero point, corrects for the fact that Kepler operates without a shutter, removes on-chip artifacts, and includes corrections standard in CCD astronomy.
The second step is to measure the brightness and location of each target star in each frame. Third is identifying and removing instrumental signatures and other sources of systematic error.
The fourth step is when things get interesting. This is when the Kepler team searches for planetary transits, in the form of periodic drops in brightness, each lasting between one and 15 hours. This is the step that has identified over a thousand planet candidates, including the five Earth-sized candidates located in the habitable zone.
Finally, the fifth step consists of a suite of diagnostic tests applied to candidates from step four in order to confirm that the periodic drops in brightness are caused by a planet transiting in front of a star.
Normally, these steps are part of a pipeline, as each step needs to be complete before the next step can start. But the work from each step can often be broken into smaller tasks that can be completed in parallel. For instance, the first step - pixel calibration - can be distributed to 84 cores, as Kepler has 42 CCDs with two readout channels each. The same can then be done with the second step - photometric analysis.
Things get interesting with steps four and five, when they can break the work down into small batches of target stars, and treat it as a massively parallel problem. Even so, it takes two weeks to process three months of Kepler data on their 512-core cluster. And as Kepler's data accumulates, the analysis time will increase as the square of the data accumulated.
"That's why we're developing software to allow us to run the transit search and the subsequent data validation jobs on the Pleiades supercomputer here at Ames Research Center," Jenkins explained. "The processing takes just as long on Pleiades per task but there are so many more cores and the data processing is so easily parallelizable that we can take advantage of the much higher number of cores that Pleiades has to offer."
In some special cases, individual candidates require significant computational resources all on their own. When that's the case, they are outsourced to Guillermo Torres at the Harvard-Smithsonian Center for Astrophysics. That's what happened with Kepler-11G, the outermost of the six planets in the system illustrated above. It's distance from the star made it particularly difficult to determine if it was a real planet and not some other phenomenon - a false positive.
"We needed to explore roughly 700 million different false positive scenarios," Torres said. "The software, which I wrote, is plain old Fortran 77, plus a few shell
scripts to distribute the 1024 jobs among the processors."
Torres ran the jobs on 1024 Xeon E5472 Harpertown processors for a total of 10,000 CPU hours, ultimately concluding that Kepler-11g is, indeed, a planet. (For more about how he validated Kepler-11g, check out the section in the team's recent paper in Nature, here).
To the people and beyond
With all the analysis done, the data is stored in the Multimission Archive at STScI (MAST), where it will be stored for at least a decade beyond the Kepler Mission's lifetime. The Kepler Data Analysis Working Group review each quarter's data and annotates it so that when the astronomical community and public gain access, they will understand what they are looking at.
"Kepler is a really sensitive instrument and there are instrumental signatures that aren't always perfectly identified and removed from the data," Jenkins explained, adding, "And there are limitations of Kepler's ability to monitor certain classes of stars, such as really, really bright ones."
The data will become publicly accessible according to a schedule the Kepler Project and NASA HQ have agreed upon. For example, the first 44 days of science data were released in June 2010. Likewise, 2 February 2011 marked the early release of 90 days of data, originally scheduled for release in June 2011.
"We wanted to make it available earlier to the astronomical community and the public so that they could have access to it sooner to make proposals for NASA's data analysis programs, and so that we could enlist the community's help in vetting the 1235 planetary candidates we've identified in just the first 120 days of Kepler observations," Jenkins said.
The move to the Pleiades supercomputer, and to enlist the community's help, are both driven by one fact: our galaxy appears to have a much more plentiful supply of planets than expected, and no one knows why.
"Everybody's scratching their head," Jenkins said, adding, "Nature simply appears to love to make planetary systems in a wide variety of configurations."