The message: "Grid Job Failed" may soon become a distant memory, if developers working to improve the system have their way. Grid job errors can be a nuisance to particle physicists. Luckily, the average job failure rate is only 5% (excluding human submission error), but there are ways a user can decrease this further.
Improving the grid job success rate is essential to particle physicists, who use grids to analyze proton-to-proton collisions at the Large Hadron Collider (LHC). The LHC will double its energy level in the next two years to 14 TeV (teraelectronvolts), and the luminosity will be increased by more than a factor of a hundred. This means that there will be a lot more data for computers on the Worldwide LHC Computing Grid (WLCG) to rummage through when searching for interesting events such as top quarks, which only exist for a billionth of a second. They are one of six types or "flavors" of quarks and are the building blocks of all known matter; consequently, finding exactly what you are looking for can be difficult.
"Everything that we see around us, on a daily basis, is mostly made of up and down quarks" commented Sarah Livermore, a PhD researcher from Oxford University at CERN. Her job is akin to that of a traffic-accident investigator; she analyzes the aftermath of collisions to find out more about the particles involved.
Step by step
In order to find quarks in the mass of information, a user must submit the raw data to the WLCG for processing. There are three main steps to submitting a grid job:
(1) Identify your input files. The ATLAS Distributed Data Management system (DDM) logs the location of a user's input files, so one just has to locate the file names and include it in the job submission.
(2) Select which files are needed to enable the job to run. Users must ensure that a job has the correct shared libraries and executable files. Occasionally, it may be necessary to check that the software the job requires is installed at the remote site where the job will be going. "This stage is the trickiest," stated Livermore.
(3) Determine the output file name and ensure it is stored at the same location as the original data files. As long as the username is included in this output file, there should be no errors. Importantly, these output files are only placed on temporary scratch disks for a month and then deleted.
Where do these errors come from? All of the grid sites are regional computer centers, such as universities. Even though it is infrequent, regional managers do take computer sites temporarily offline for maintenance, causing the dreaded "Grid Job Failed" message. If this happens, the only solution for users is to wait for the site to come back online or resend their job to another site.
It's how you pick yourself up
Ironically, an individual's grid job problems can actually be beneficial in the long run to the user community. For example, the ATLAS grid support email system can tell users a lot about grid job nuances and how problems are solved, by allowing them to see how others have dealt with a situation.
The team is also looking into whether a Graphical User Interface (GUI) would allow users to submit their jobs more easily. There are pluses and minuses to this approach, according to Graeme Stewart, ATLAS Distributed Development Coordinator. He has found that users prefer command line interfaces rather than GUIs because they can tailor them to their needs more easily. The support structure is 'well-manned' said Mark Slater, a research fellow with GridPP. "After all, the whole point of the grid is to get physics done," he said.