The grid and the cloud are dead! Long live big, open data!
This is a sentiment which has come across strongly at a number of recent meetings. However, in moving from one paradigm to another, it is vital that we do not discard the experience and technology gained previously, says European Grid Infrastructure director Steven Newhouse.
At an exploratory meeting of the Research Data Alliance held earlier this month in Washington DC, US, around 120 people from around the world gathered to discuss the factors currently limiting the amount of data which can be shared by researchers across disciplines, institutions and countries. Technical interoperability issues were not the only factors discussed, however. The meeting also focused more broadly on the community-level barriers to a collaborative global data infrastructure. Much emphasis was placed on ensuring that researchers have open access to, and are thus able to exploit, big data sets generated by large national and multinational research institutions. A prime example of this was presented at the meeting by Chris Greer of the US National institute of Standards and Technology. He explained how the release of NASA Landsat images in 2008 for unrestricted use had now created an estimated value of $935M per year for the environmental management industry.
However, the main challenge in building any infrastructure is to strike a balance between the common needs and those that are specific to a particular scientific domain. Each research community will have developed its own vocabulary, its own metadata descriptions and its own data-access services to expose their community's underlying data models. Consequently, finding the common mechanisms needed to allow different research communities to collaborate and share data is of paramount importance. While a solution to this issue is still far from clear, some consensus is emerging. For instance, there is a widely acknowledged need for persistent data identifiers that enable individual data sets and data objects to be described, discovered, located and usage tracked. Authentication and authorization is also generally deemed important when discussing open science data, since funding bodies like to know how the generated data is being used and it allows the possibility for certain data to be restricted to members of a particular collaboration for limited periods of time.
Leif Laaksonen from the Finnish IT Center for Science described how the European Data Infrastructure (EUDAT) project is examining some of these technical issues, with the recently started International Collaboration on Research Data Infrastructure (iCORDI) initiative now providing coordination and input for international activities, such as the Research Data Alliance (RDA). Equally, Andrew Treloar related how the Australian National Data Service is working to help scientists transform data (generally unmanaged, disconnected, invisible and has a single user) to structured collections (managed, connected, findable and reusable) that can provide more value.
Big data - size matters… but so does accessibility, interoperability and reusability
The Microsoft e-Science Workshop was also held earlier this month in Chicago, US. Here, the focus on big data continued with sessions dedicated to 'open data for open science'. With examples drawn from the US National Science Foundation's Earth Cube initiative, the issues of data interoperability and reuse were again prominent. The inherent coupling between functions within natural ecosystems and their impact on society means that the environment is an excellent example of how data reuse is needed across different domains in order to maximize knowledge. For instance, how do you ensure that satellite data can be coupled to land, sea and air observations collected independently over many years? Equally, how should the data coming out of the many instruments that make up an ocean observatory be integrated given the various manufacturers involved and their respective data formats?
E-education
The 8th IEEE International Conference on
eScience 2012 was also held in Chicago, US,
earlier this month. At this event, much emphasis
was placed on the importance of education.
The specialist nature of the algorithms widely
used in modern scientific research means that
graduates moving into computational research
fields from undergraduate science courses
often need to be taught advanced programming
programming skills. Gregory Wilson from
Software Carpentry expanded upon this theme
in his keynote speech. He highlighted the need
for 21st-century researchers to not only be
educated in how to use the outputs of e-science
(the data scientist), but also to be given the
skills to manage the software they produce
(software engineer).
Clearly, training up all researchers to
be software engineers is a challenging prospect,
especially given the time pressure on
most degree courses. But providing intensive
software engineering courses lasting from two
days to two weeks for graduates has been shown
to have a significant impact. Nevertheless,
even with this approach, Wilson argues, it is
important to first teach students that which
they are likely to see value in, rather than
immediately focusing on fundamental computing
science concepts.
Such 'e-education' is of paramount importance
in ensuring that research software is
developed effectively by those in the community,
so that it is both maintainable and sustainable.
Another area of investment that would benefit the
scientific programmer is having access to an easily
reusable library of software modules. This would
allow researchers to build on the shoulders of
those that have come before… without, of course,
stepping on any toes.
Organizations such as the Open Geospatial Consortium (OGC) - composed of representatives from academia and industry - play an important role for the environmental community due to the geospatial nature of many of their datasets and form a basis for much of the interoperability work that now takes place between different geospatial systems. The focus of their work and that of other similar organizations is not so much just on standard service interfaces, but also on standard data languages. The issue is about opening up your data for access and reuse and not just opening up your database for access.
Given the size and number of the environmental data sets generated from instruments and simulations, converting this data to information and knowledge provides many challenges. High-performance computing can provide the raw simulation output and high-throughput computing can help support 'ensemble studies' to explore the sensitivity of the simulations. These local resources can be supplemented by capabilities accessed through grids and through commercial and publicly funded clouds.
So, has the hype around big data grown to the point where it has now swallowed up the cloud, which in turn is still bloated from having gobbling up the grid?
One of the primary challenges of big data is actually finding the infrastructure to analyse it. Here, the cloud's business model of the flexible and rapid provisioning of resources demonstrates its strengths. Creating the storage to handle the intermediate and final results generated from a high-performance computing cluster provisioned on demand, clearly demonstrates the need for a flexible resourcing model. As the data used in the analysis will need to be retrieved or placed in persistent data stores, issues such as authenticated and authorized access to these distributed resources become critical - a typical grid scenario. Consequently, in moving from one paradigm to another, it is important that we do not casually discard the experience and technology we have previously gained.