Feature - A solid production demonstration of the LCG

Feature - A solid production demonstration of the CMS role in WLCG

Last fall's unplanned shutdown of the Large Hadron Collider was a disappointment for physicists around the world. But for organizers of the computing grid supporting the collider's detectors, it was also an opportunity to keep working hard. For the first two weeks of June, instead of flooding the grid with data from actual particle collisions, experiment collaborators at CERN and remote computing sites in Europe, Asia, and North America joined up to test the ability of the collider's Worldwide LHC Computing Grid (WLCG) to record, transfer and analyze simulated data in a step-by-step "production demonstration."

Scientists conducted a series of challenges, collectively called the Scale Test of the Experimental Program 2009 (STEP09). All four LHC experiments participated in the test. For example, at the CMS experiment (see earlier iSGTW story, CMS readies network links for LHC data) they first tested the archiving of older recorded data from CERN to CMS' seven Tier 1 computing sites. There, scientists checked the Tier 1 central processing power as they shuttled data to Tier 2 sites. Finally, they challenged the full physics analysis capacity of the Tier 2 sites. On 15 June, as the curtains closed on STEP09, Oliver Gutsche, a Fermilab physicist who was one of those participating in the effort for the CMS experiment, declared the overall performance "very good."

While the CMS portion of this grid - like the rest of the WLCG - was ready to take data last September, says Gutsche, the test "gave us an opportunity to test parts that could not be tested on the previous schedule." It also showed how the system will function under simultaneous demands from the LHC's three other detectors.

A primary STEP09 goal was testing the tape systems at CERN and Tier 1 computing centers. When the LHC is operating, computers at CERN will need to record - "write to tape" - at least 15 Petabytes of data per year. Thanks to this run-through, Gutsche says, "We are confident that CERN could write to tape at the speeds needed" when data from collisions begins pouring in.

Another key goal was gauging the analysis capabilities of Tier 2 computing centers. CMS aimed to employ 50 percent of the grid's analytical power, and while only an ongoing study can prove it succeeded, Gutsche says the prognosis looks good. During STEP09's 13-day run, Tier 2 centers performed over 900,000 successful analysis jobs.

Still, the test revealed room for improvement.

Remote Operations Center at Fermilab. Image courtesy Fermilab

Previous page: One of the first images from CMS, showing the debris of particles picked up in the detector's calorimeters and muon chambers after the beam was steered into the collimator (tungsten block) at point 5. Image courtesy CERN

Making a good thing better

Operators at CERN and the remote computing sites were forced to log long hours, particularly in the pre-staging process. But their efforts revealed principles that will ease the future automation of those procedures. "Sites are happy because we stressed them, and they learned how to run more efficiently," says Gutsche. "Now they have ideas for what they can improve."

Echoing this observation was Ian Fisk, a CMS collaborator at Fermilab. "We wanted to show we could run on 'non-hero-mode,' " he said. "We want to finish a test saying, 'That was easy. We could run for a year at that level.'"

Jamie Shiers of CERN, who organized the computing tests, including STEP09 (see earlier iSGTW article, People behind the LHC grid) said, "Many of the Tier1s, and the Tier0, sustained a load that was artificially high - certainly higher than early data taking - with generally smooth and sustainable operations. But a few sites did not, and this has triggered us to undertake a perhaps overdue analysis of the root causes with a clear desire to fix and retest. We saw significant progress since a year ago."

Shiers added, "For Tier2s, the results were more variable: Monte Carlo production is clearly a largely solved problem. As for analysis, some sites - even very large ones - did extremely well, while others did not. Once again, we need to understand the root causes and fix them. In some cases, this may be hard: there has been a feeling for quite some time that the external network bandwidth for at least some sites is not large enough and that the internal bandwidth all the way to the data is also too small. Most likely they will need major configuration changes."

Beginning in July, CMS scientists will use the grid to analyze cosmic ray data, which stream into the detector even when the accelerator is off. Then, when the LHC turns on - about mid-autumn, said CERN's director-general - the real challenge will begin.

Editor's Note: Jamie Shiers will chair a postanalysis of STEP09, to be conducted at CERN on 9-10 July. Registration is still available. More information on STEP09 is available at the coordinated press release.

-Rachel Carr, Fermilab, for iSGTW