iSGTW Opinion - Avalanche warning: the new challenges of the data grid

Opinion - Avalanche warning: the new challenges of the data grid

Available disk space at Canada's TRIUMF Tier-1 center was rapidly consumed, with users filling one year of space in just three months. More storage was again made available in November to meet demand.
Image courtesy of TRIUMF and WLCG

The Large Hadron Collider at CERN will produce many Petabytes of data every year; other applications in astronomy and genomics will also generate and rely on massive amounts of data.

How will we manage these data? Where will we store them? The new "data grid" brings with it certain responsibilities that must be borne by the users.

Managing storage for a large user base is not a new problem, but grids greatly amplify the scale.

Allocating storage resources is more complicated than managing compute clusters. Upon completion of a compute job, CPUs are simply returned to the common pool; however, their output must be stored and cannot be deleted indiscriminately.

Laws, limits, allocations and restraint

Even the allocation of small storage quotas to thousands of users will add up over time to cause serious problems.

One solution is to put time limits on storage allocation, but this does not help users who want to store data indefinitely.

Some of the staff at the TRIUMF Tier-1; the author is second from the right.
Image courtesy of TRIUMF

Another solution is to charge for storage space, which, however, is not in the spirit of grids. Note that this is more of a management issue-of people, not of systems-than a technical one.

Take a look at your desktop

Many tools exist to manage large pools of data, but these do not indicate when the data can be safely deleted. What is a facility manager to do when faced with hundreds of thousands (if not millions) of files owned by thousands of users, to which he has no immediate access? Users are much less likely to worry about disk usage at a site halfway around the world than they are at their university IT centre. We have all been guilty of not cleaning up our desktop; we just go out and buy a larger disk!

Policies must be put in place to keep the explosion of data in check. I fear that this can only be done by making most storage space of the "scratch" or "temporary" variety, which will put restrictions on users. Exceptions can be made for large projects such as the LHC, which aggregate the needs of a multitude of users, but the number of these exceptions must be kept reasonable, or system administrators risk spending all of their time contacting users to clear space.

One year filled in three months

An example of the problem can be found at TRIUMF, Canada's national laboratory for particle and nuclear physics, which houses one of ten Tier-1 centers for the ATLAS experiment at the LHC. After an expansion of the centre in August 2007, one year's storage allocation was filled in about three months!

The main point of this comment is to convince users that it is their responsibility to control their use of storage on the grid. Resource providers and system administrators supply the tools for data management-directory structures, file movement and so on-but users must act responsibly when managing their storage allocation. It is not reasonable for users to wait for an email asking them to delete files because the center is running out of space.

Again, the scale of the problem is amplified by the grid. Whatever solution is adopted, it will inevitably require all of us to exercise restraint.

- Michel C. Vetterli, Simon Fraser University/TRIUMF