• Subscribe

Feature: Back to Basics - Data Management

Back to Basics - Data management


Tracey Wilson is a program manager for Avetec's HPC Research Division, DICE, the Data Intensive Computing Environment. DICE is a non-profit program that serves HPC and IT data centers in commerce, government and academia by providing independent third-party evaluations of data management practices on its geographically distributed test bed.

From smartphones to weather forecasts, data drives our world.

In the simplest terms, data is information, which can be found in many forms. Back when computer scientists used punch cards to store data, keeping them in order was a way of controlling or managing data.

We used to talk about how incredibly large a gigabyte of data was. Now terabytes and petabytes are the norm in scientific circles, and we will soon be talking about exabytes just as casually. And as the amount of data we generate grows, so does our need for data management.

Data management - the effective control of data throughout its entire lifecycle - is getting more important every day as systems become increasingly complex, software demands increase, and the size and number of data files generated continues to grow exponentially.

File Systems in Data Management
When choosing a storage system for electronic data, you have to choose a physical medium and a logical medium. The physical medium could be a disk, a hard drive, or tape. The logical medium is the file system itself, and there are many types. The most common types of files systems are:

  • Local file systems used on workstations or servers - Windows NTFS or Linux GFS
  • Shared file systems for large data repositories or archives - Sun SAM-QFS, Quantum StorNext, or SGI DMF
  • Distributed file systems - SGI CXFS
  • Parallel file systems - Lustre, IBM GPFS, or Panasas PanFS

Different mediums lend themselves to different purposes, based on their characteristics. For example, a very fast parallel file system is something you'd use for fast access by several clients at once and is typically seen on high-end or high performance computing systems.

On the other end of the spectrum, archive systems are used for longer-term storage and vary in performance. Today, they are typically divided into tiers by storage duration. For example, tiers may include:

Tier 0 - Solid state disks. This provides the highest level of performance and would be most appropriate for data to which you need immediate and regular access.
Tier 1 - Hard drive disks (SAS) and fiber channel disks. These have an expected lifetime of 45 days.
Tier 2 - Higher performance serial ATA drive (SATA). This is a cheaper disk storage pool where data may stay for several months.
Tier 3 - Lower performance SATA. Data may stay here longer than several months.
Tier 4 - Long-term storage on tape or virtual tape library with high capacity. This is the longest-term option, where data is written once and rarely read.

Locality, Movement, Integrity, and Manipulation
There are four main areas within data management that are of major concern:

Data Locality - Locality is about the ability to archive and access data from different locations even though data may be stored at one primary location; this makes large remote scientific data sets as easy to access as if they were in a local file system. Questions that administrators have regarding locality include: Is data stored in a way that is logical and accessible? How is the data viewed from a local system?

Data Movement - This is the ability to move data efficiently and reliably to geographically dispersed systems and locations. What path or mechanism can effectively move data between locations? Can the data be sent in one large transfer stream or can it be broken up and sent in several streams in parallel? Do I need to worry about encryption and how do I verify integrity of the data once it arrives?

Data Integrity - This refers to the ability to maintain the quality and security of data during transfer, access or storage is paramount. Once you move data, access it, or copy it, is it the same as before? Has it been corrupted or changed beyond your ability to recover?

Data Manipulation - The capability to change, search and manage data in local and distributed environments allows users to get more use out of their data. There may be better ways to search and modify data using metadata (the information that describes the data itself).

Day-to-Day Aspects of Data Management
The daily management of data may not be outwardly recognized as the key to an experiment's success, but effective control of this precious commodity requires careful administration.

Responsibilities vary greatly among data center employees. Some administer and maintain the integrity of file systems so that users will be able to access their data. Archive administrators do the same, and also have to determine how fast that data is growing and stay abreast of data locality, movement and storage trends. In addition, they must identify bottlenecks and decide how to purge data or expand infrastructure.

For example, if doing remote site transfers, the organization may be restricted to conducting backups at night. If local backups are the norm, the backup processes may have an impact on the ability of others to conduct their work by restricting access to files. Deciding how to balance these needs and concerns is part of an archive administrator's job.

Good Data Management Means Better Use of Information
Good data management helps data centers maintain better control. Users can access and move data more effectively when structure, data flow and size are correctly set up for performance. Benchmarking networks and file systems can help diagnose problems and provide keen insight into optimum performance tuning parameters.

As long as the race continues to increase computing power, data management will hold more and more of our data centers' attention as we seek to gain more control over the information we gather.

For more about data storage, see the iSGTW Nature Networks Forum discussion about magnetic tape.

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2023 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.