Feature - Jamie Shiers on the STEP'09 postmortem
Recently, CERN conducted the STEP'09 test, a full- scale assessment of all the computing resources which will be used to process experiment data from the Large Hadron Collider. Those computing resources include sites all over the world which are organized into different levels, or "tiers."
Jamie Shiers leads the grid support group in the IT department at CERN, and is responsible for the coordination of the worldwide LHC Computing Grid (wLCG) service. He organized the postmortem for STEP'09 at CERN. iSGTW caught up with him to find out how it went.
iSGTW: What is the STEP'09 postmortem?
Shiers: It's the workshop to review the results from the STEP (Scale Testing for the Experiment Program) test. The big difference between STEP and other tests we've done is that it's supposed to emphasize that we are in full production. It's a test of readiness, oriented at what we need for this year at CERN.
It's not the first postmortem workshop we've held, but this was specifically about STEP'09 and it was bigger in terms of the number of attendees.
iSGTW: What were the aims?
Shiers: What we were trying to do was demonstrate that, from a computing point of view, we are ready. And if we aren't, to understand exactly where there were problems so we can come up with a plan to correct them. It was a very thorough postmortem - we looked at things from the experiment point of view, from a service point of view and from a site point of view.
iSGTW: What were the main findings?
Shiers: We've come up with a list of things to do, which fortunately is rather short and certainly shorter than it could have been. In some areas it looks like we are more ready than a pessimist might have feared, particularly in the Tier-1 layer where reprocessing of the data occurs. The networking for Tier-1 is well under control.
The analysis of the data is mainly done at the Tier-2s, and the results here were less positive. But this is to be expected because the analysis involves many more people and is much more chaotic. The other things you can organize and schedule much more easily, but in Tier-2s you don't know how many users there are, which datasets they have, which order they'll read them. This is something we need to work on.
As we are learning more about grid computing we understand what can be provided sensibly at the generic level and what can only be provided when you are deep inside the computing model.
iSGTW: What was most surprising about it?
Shiers: Probably the most surprising thing is how well we did! Many things could have gone wrong and in fact many things did. Sites had both scheduled and unscheduled downtime. For example motorway construction in Switzerland took out a large fraction of the dedicated optical network - you don't foresee motorway construction doing this. But despite these things we were able to continue and recover.
Maybe surprising isn't the right word, I would say "pleasing," because we've spent many, many years preparing for the unexpected. I think we've shown we can handle it.
iSGTW: What have you done since the post-mortem?
Shiers: We've been working through our list. There are a small number of sites with some issues which are being retested, with largely positive results. Most of the problems with the Tier-1 sites have been addressed, but we have to do more work at the level of Tier-2s.
The target is to have a follow-up by the end of September. At the EGEE'09 conference we will have a session to work through the main improvements since the STEP'09 post-mortem workshop, so hopefully by then the main issues will have been proven to be solved in production.
iSGTW: How confident are you about the LHC restart?
Shiers: In terms of computing readiness, we demonstrated much more than we've ever done before and in a more sustainable fashionable. Most sites said that for them this was business as usual. I would say for everything we know, we are well prepared, but we are also well prepared for problems we guess might happen. We found that, even for problems which are seemingly insurmountable at first, a work-around solution to keep us going can be developed over a couple of hours or days.
iSGTW: What excites you most about it all?
Shiers: We're looking forward to doing something as challenging with real data.
Bonus - Jamie Shiers also talks about the importance of grid computing, people, collaboration, and barbecues for the LHC. Click here to listen to an audio clip of him on the Gridcast website.
-Seth Bell, iSGTW