• Subscribe

All in good time

The message: "Grid Job Failed" may soon become a distant memory, if developers working to improve the system have their way. Grid job errors can be a nuisance to particle physicists. Luckily, the average job failure rate is only 5% (excluding human submission error), but there are ways a user can decrease this further.

ATLAS detector: Installing the ATLAS calorimeter. The calorimeter measures the energies of particles produced when protons collide at the center of the detector. You can see the calorimeter before it was moved into the middle of the detector. The eight torodial magnets can be seen within ATLAS. A scientist is located at the bottom of the photo for scale. November 2005. Image courtesy ATLAS CERN.

Improving the grid job success rate is essential to particle physicists, who use grids to analyze proton-to-proton collisions at the Large Hadron Collider (LHC). The LHC will double its energy level in the next two years to 14 TeV (teraelectronvolts), and the luminosity will be increased by more than a factor of a hundred. This means that there will be a lot more data for computers on the Worldwide LHC Computing Grid (WLCG) to rummage through when searching for interesting events such as top quarks, which only exist for a billionth of a second. They are one of six types or "flavors" of quarks and are the building blocks of all known matter; consequently, finding exactly what you are looking for can be difficult.

"Everything that we see around us, on a daily basis, is mostly made of up and down quarks" commented Sarah Livermore, a PhD researcher from Oxford University at CERN. Her job is akin to that of a traffic-accident investigator; she analyzes the aftermath of collisions to find out more about the particles involved.

Step by step

In order to find quarks in the mass of information, a user must submit the raw data to the WLCG for processing. There are three main steps to submitting a grid job:

(1) Identify your input files. The ATLAS Distributed Data Management system (DDM) logs the location of a user's input files, so one just has to locate the file names and include it in the job submission.

(2) Select which files are needed to enable the job to run. Users must ensure that a job has the correct shared libraries and executable files. Occasionally, it may be necessary to check that the software the job requires is installed at the remote site where the job will be going. "This stage is the trickiest," stated Livermore.

(3) Determine the output file name and ensure it is stored at the same location as the original data files. As long as the username is included in this output file, there should be no errors. Importantly, these output files are only placed on temporary scratch disks for a month and then deleted.

Where do these errors come from? All of the grid sites are regional computer centers, such as universities. Even though it is infrequent, regional managers do take computer sites temporarily offline for maintenance, causing the dreaded "Grid Job Failed" message. If this happens, the only solution for users is to wait for the site to come back online or resend their job to another site.

It's how you pick yourself up

GridPP - Example of a grid site. In this case it is the UK particle physics computing grid: GridPP. Image courtesy GridPP.

Ironically, an individual's grid job problems can actually be beneficial in the long run to the user community. For example, the ATLAS grid support email system can tell users a lot about grid job nuances and how problems are solved, by allowing them to see how others have dealt with a situation.

The team is also looking into whether a Graphical User Interface (GUI) would allow users to submit their jobs more easily. There are pluses and minuses to this approach, according to Graeme Stewart, ATLAS Distributed Development Coordinator. He has found that users prefer command line interfaces rather than GUIs because they can tailor them to their needs more easily. The support structure is 'well-manned' said Mark Slater, a research fellow with GridPP. "After all, the whole point of the grid is to get physics done," he said.

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2022 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.