iSGTW Feature - "CIC on duty": smooth operations behind the scenes

Feature - "CIC-On-Duty": smooth operations behind the scenes

Hélène Cordier presents at the Regional Operations Center Managers' Meeting during EGEE '07, speaking on developments in "CIC on Duty" operations.
Image courtesy of Toth Csilla

As a grid user, you want to send jobs to your grid, and you want results back.

Hélène Cordier is one of the hidden coordinators working behind-the-scenes to ensure this happens as smoothly and as often as possible.

An integral part of operations in the Enabling Grids for E-sciencE project, Cordier is deputy of the French Regional Operations Center and also deals with French operations for the Large Hadron Collider Computing Grid.


Cordier has been the driving force behind "CIC-On-Duty," where CIC stands for Core Infrastructure Center.

Dubbed COD, Cordier's scheme helps ensure smooth operation of the EGEE grid. All EGEE federations contribute one COD team of two to three people.

"Every week a different COD team is responsible for the daily monitoring of project-wide core grid services," explains Cordier. "This includes monitoring a repertoire of site information, including test results sent to sites every day."

The COD scheme has not only been recognized as essential in stabilizing EGEE operations, it has also proven to be scalable: while the number of monitored sites has increased three-fold in the three years since the creation of COD, the number of trouble tickets opened against monitoring failures has not changed.

"Communication is crucial"

"COD is more than just a simple rotation of operators," Cordier says.

"We've set up a number of working groups focused on important issues, such as integrating monitoring tools into a COD dashboard; developing failover mechanisms; improving our tests and tools; and standardizing work procedures. This latter issue is reflected in the constant update of the relevant Operations Procedure Manual."

One of the top three demos at EGEE '07, the "CIC-On-Duty" dashboard and operations portal drew a lot of attention. Integrating several monitoring and communication tools to provide a seamless workflow between sites, these tools were developed and are hosted by the operations portal team in Lyon.
Images courtesy of Toth Csilla

"We organize quarterly meetings involving 25 to 35 operators, administrators and developers from all federations, which mean we've been able to build an adaptative scheme that is still continuously evolving."

"Even if we do report and write proposals in the aftermath, the meetings are more like brainstorming sessions and we try to keep the management out of the way," she laughs.

For Cordier, regular meetings are also indispensable for sound relationships between the partners.

"You still need a lot of goodwill to cope with "extraordinary" situations, and you only get this when people do actually know each other."

"Yes, there are differences in the nationalities and work practices of people in different federations, but these differences are tiny compared to the challenge of just getting people to work together," she says.

- Hannelore Hämmerle, EGEE