iSGTW Feature - Grids don't take vacation

Feature - Grids don't take vacation

A big thanks to those who are giving up their time over the end of year break to keep grids all over the world on-line.
Image adapted from

While the rest of us are on vacation, enjoying a snowy (or scorching) end of year break, the grids we depend on keep crunching through jobs, sending a stream of data around the world.

In fact, the end of year break sees many grid services working harder than ever. How does it happen?

iSGTW took time off from our own hectic holiday schedule to see what goes on while the users' away.

Nicholas Thackray - CERN grid services

Frank Würthwein - Open Science Grid applications coordinator

Nicholas Thackray - CERN grid services

Our users rely on continual service, even when they're away on vacation. If this stops working we're going to have a queue of people at the door, all wanting to beat us over the head with a stick.

Previously things were so brand new that it was difficult to have an emergency presence for everything. Now we can operate with 24-hour coverage over the entire Christmas break.

The first time we tried remote service we thought that Murphy's Law would play up and the entire center would go offline. We always try to plan for the worst case scenario, but unfortunately you can't do a dry run of your response to these kinds of failures in advance. You can't just turn the computer center off as part of a test.

We do get fewer problems over the break-all the people who normally fiddle with things are away on holiday.

Coverage is provided by a rotation of on-duty managers, responsible for checking for problems in their domain and delivering trouble tickets to the relevant people. Sometimes two or three groups can be involved.

Our users just want to submit data or move data, which means that individual bits can fail for a short period, or be slightly degraded. Some services are more crucial than others; some are more reliable than others; some are used more often than others. The important thing is that the results are there in the end.

People tend to save their jobs for the holiday period. We saw an initial spike in the number of jobs that trailed off as the holiday progressed. The larger Virtual Organizations also have job submission robots, which keeps things going over Christmas. Many people will leave their computer-intensive jobs until Christmas.

Through the efforts and good will of our volunteers we can continue to offer this service. Thank you to everyone who is giving up their time over the festive season.


Frank Würthwein: "We're now at the point that things are robust enough to work. You can go on vacation for a week, and if you're lucky, when you come back, things are still working."
Image courtesy of Graham Ramsay

Frank Würthwein - Open Science Grid applications coordinator

Our essential grid services are generally available 24-7. Otherwise it's all very heterogeneous. Some places have guaranteed operations; others go home for different amounts of time.

Whether services stay up over the break depends a lot on what fails, on whether disaster strikes, like a power outage or a typhoon. But at most sites things go smoothly and nothing happens. If you left the infrastructure by itself, any week of the year, it would by-and-large just work.

Many people go beyond the call of duty to keep their grid services running over the break. People are sufficiently excited about what they do that they continue to keep an eye on it, even when they're on vacation. We see this all the time, across many, many sites. People are willing to check on things, and if needed, to put in the extra time to fix it back up. We stay up either because we're cruising, or because people are fixing problems out of their own good will.

A lot of sites around the world are small enough that when one person goes on vacation, some part of the service will lose its expert. It's an all-year problem. Thanksgiving is an even bigger deal. We have four days when virtually nothing happens. But despite the fact that everyone goes on holiday, things work.

The vacation does provide an opportunity for people to access much more computing time than they would otherwise get. We have a winter conference season which starts around February, so you tend to get a few people who feel the pressure to get results out in time. People who have extraordinarily large computing needs tend to use the vacation time to get their jobs through the queues.