From improving seismology code, to simulating supernovae, tornado-producing supercells, and the chemical structure of HIV, the Blue Waters supercomputer at the US National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, US, has been making history since its initial runs in late 2012 and early 2013. There was one thing, however, at which Blue Waters was not adept — sharing data. The science and engineering community made this point last summer at the 2014 Blue Waters Symposium.
Massive supercomputing projects involve massive amounts of data. Sharing this data with the broader scientific community is of critical importance not only for advancing science, but also for collaboration, innovation, and economic growth. For the Blue Waters project, the future has arrived with the new prototype Blue Waters Data Sharing Service (DSS). The DSS is now available for Blue Waters users to share data sets with their research community or the broader public.
There are two classes of sharing based on the needs of the partners and data: active data sharing for projects with current allocations on Blue Waters, and community sharing for data produced by prior projects. Users from each group can share data using Globus sharing capabilities or a web service interface. Each interface has specific requirements that determine which approach should be used, and each class of sharing also has unique requirements.
“Projects (PIs) can submit a service request, which is really just a means for us to help the teams better prepare their data for distribution,” says Jason Alt, a programmer at the NCSA who (along with colleague Mark Klein) implemented the DSS. The service will allow users to share their research data with colleagues who don't have access to Blue Waters. Former users can also share their data after their allocation ends; in this case, the data owner is required to obtain a Data Object Identifier for the data set.
Current and former partners will have two options for sharing data: via Globus for large data sets (larger than about 4 GB and/or more than 100 files), and through a web browser for smaller data sets. Globus also enables researchers to limit access to their data, whereas web access permits anyone access. Through the web interface, science teams can also create pages that explain the data, show results, or share a number of other useful details like any other web page.
Both methods are read only and require the data owner to provide documentation and metadata to make the data more useful and self-contained. The shared data will also count toward the science team's storage limit on the Blue Waters sub-storage systems.
A bridge between GridFTP and high-performance storage systems
The NCSA has also developed new tools for transferring data between machines, specifically when source and target machines use different transfer protocols. “With this enabling piece of software, existing and new HPSS installations can join with grid infrastructure as first-class citizens and leverage emerging transfer solutions such as Globus,” says Alt, developer of the HPSS DSI (data storage interface).
The HPSS DSI translates communications between the GridFTP protocol and the HPSS API (application programming interface, the set of commands HPSS understands). The new Blue Waters DSS is only one of myriad possible implementations of the HPSS DSI. More information on the HPSS DSI and the code base is available at Github.