Feature - Case Study: Einstein@OSG
For over five years, volunteers have been lending their computers' spare cycles to the Laser Interferometer Gravitational Wave Observatory (LIGO) and GEO-600 projects via the BOINC application Einstein@Home. Now a new application wrapper, dubbed "Einstein@OSG," brings the application to the Open Science Grid.
Today, although Einstein@OSG has been running for only six months, it is already the top contributor to Einstein@Home, processing about 10 percent of jobs.
"The Grid was perfectly suitable to run an application of this type," said Robert Engel, lead developer and production coordinator for the Einstein@OSG project. "BOINC would benefit from every single CPU that we would provide for it. Increasing the number of CPUs by 1000 really results in 1000 times more science getting done."
Getting Einstein@Home to run on a grid was not without difficulties. Normally, a volunteer would download and install the application. The application would constantly download data, analyze it, and then return the results. In short, each instance of Einstein@Home has a permanent home on a volunteer's computer.
The same process would not work on the Grid. Grid jobs cannot run indefinitely, so each instance of Einstein@OSG was given a time limit.
"Once the time limit is up, the Einstein@Home application exits, followed by the Einstein@OSG application, which will save all results to an external storage location," Engel explained. "The next time Einstein@OSG starts, it likely starts on a different cluster node which may use a different architecture."
Next, the Einstein@OSG application detects changes in the environment, such as the architecture, location, version of software, or network connectivity, and then compiles any missing software 'on-the-fly.' After a final check to verify that all requirements for Einstein@Home are met, it starts up. The results from the previous run are loaded from the remote storage location, and Einstein@Home picks up where it left off.
An application on a grid will encounter software and hardware issues much more frequently than a desktop application such as Einstein@Home, according to Engel. This is because grids are much more complex, and deal with an extremely high volume of jobs.
Because the average Einstein@Home user will only encounter an error every couple of months, it's practical for her to handle the error manually. With Einstein@OSG running on up to 10,000 cores, however, there are errors every couple of minutes. Fixing these manually simply isn't practical, so Einstein@OSG eventually automated the process.
"It was only because of that mechanism that we were able to scale up," Engel said. "A computer never gets tired looking for errors and fixing them, unlike me, who likes to sleep at night and spend time with his family."
Before Engel began work on Einstein@OSG, he was a member of a team led by Thomas Radke at the Max Planck Institute for Gravitational Physics. Radke's team created a wrapper for Einstein@Home compatible with the German Grid Initiative (D-Grid) in 2006. Part of Engel's contribution was the design of a user interface that allows one person to effectively monitor and control thousands of Einstein@Home applications.
"Back then it consisted of a command line tool that would summarize all activities on the Grid on a single terminal page," Engel said. Now the tool records activities and uses that historical data to create error statistics. Those and other statistics are displayed on an internal webpage.
The wrapper created by Radke's team could not simply be repurposed to run on OSG, unfortunately.
"OSG and the German grid are different," Engel said. For example, "in Germany the entire grid depends on Globus."
Engel and his team examined their options for getting Einstein@Home onto OSG, and concluded that the best option was Condor-G, a sort of hybrid of Condor and Globus. But implementing Condor-G would have required a great deal of work, delaying the launch of Einstein@Home on OSG.
That's why Engel's team opted to implement Globus' GRAM, which took only two weeks of work, before they began work on a Condor-G solution. It's a good thing too, because they soon discovered a serious issue with GRAM.
"It doesn't go up in scale very well," Engel said. "If you try to run more than 100 jobs on a given resource, you'll bring down that resource."
Still, given a chance to do things differently, Engel would have implemented GRAM, he said. "It meant that for a year, we could run jobs on OSG."
The Condor-G version went live in September 2009, and it has rapidly picked up steam. "On a typical day, we are running between 5000 and 8000 jobs at any time," Engel said. "Before that we were running approximately 500."
-Miriam Boon, iSGTW