At the recent Internet2 Global Summit, iSGTW sat down with George Komatsoulis, senior bioinformatics specialist at the USNational Center for Biotechnology Information (NCBI), part of the US National Library of Medicine, National Institutes of Health.
Komatsoulis discusses the state of distributed research and the NIH Commons, a scalable virtual environment to provide high-performance computing and data storage for bio-medical research. Early pilot activities are expected in 2015.
How would you characterize the state of big data usage in the bio-medical fields today?
The creation of the office of the associate director of data science at the NIH was a broad recognition that data science problems are a critical part of doing biomedical research. There have been many initiatives to come to grips with this reality. At the National Cancer Institute there are the Cancer Genomics Cloud Pilots. There's the genomic data commons at the Office of Cancer Genomics, and at the NIH level there is the Big Data to Knowledge (BD2K) program.
There's talk of a looming big data tsunami. How would you characterize the challenges of big data analysis?
It is certainly true that there are challenges, but it's important to keep perspective. This isn't the first time data growth has outstripped computing capacity. In 1965, when Margaret Dayhoff published the atlas of protein sequence and structure, a researcher could generate 100 amino acids worth of protein sequence in a year — when they were working with punch card technology.
In 1977, Fred Sanger and Maxam and Gilbert developed new sequencing technology allowing a person to generate about 2,000 finished sequences a year. In 1986, dye-terminator technology came along, and when automated sequencers became common in the early 90s you could generate a few thousand bases daily. When microarrays emerged in the mid 90s, this jumped to a few hundred thousand observations per chip per day.
You keep seeing these kinds of inflection points, and what we did then is what we need to do now: judiciously apply the technology we have and look to the next generation of technology for support. As Ernest Rutherford once said: 'Gentlemen, we have run out of money. It's time to start thinking.'
At the end of the day we're going to get an enormous amount of data with the potential to fundamentally change our understanding of biology, human health, and disease. Thinking of big data as a tsunami has such negative connotations; it's not really the way to think about it.
How does the Commons fit into this picture?
The Commons is a scalable three-part environment that will provide computing and storage for sharing digital bio-medical research products. At the bottom you have some kind of computing platform required for storage and computation. In the middle you have search: A means to find things in the Commons. Lastly, you have the digital objects themselves with a way of identifying and citing them.
Individual investigators, existing databases, and programs such as BD2K will be the sources of the digital objects. The Commons will make it easier to use (and reuse) the data and software contained within these well-curated resources; it won't replace them.
How will the Commons be implemented?
We're not looking to build the Commons as a unitary system. We're trying to have it evolve through development by the information technology community.
To instigate this evolution, we're going to distribute dollar-denominated credits to researchers with interesting digital objects or with a desire to use the Commons for interesting biomedical research. Researchers can use these credits at the approved provider of their choice.
The Commons will also be a consortium of qualified providers — public or private clouds, or academic or national labs, high-performance computing centers — who are willing to meet requirements defined by the NIH.
Our goal is to incentivize investigators to put things into the Commons, both to make them available and to do additional research. We also want to incentivize them to choose the provider with the best value for the particular types of data and analyses that they want to do. And we want to incentivize providers to provide basic and potentially value-added services at the lowest possible price, so they can compete for the credits held by investigators.
Ultimately, we're looking to set up a marketplace that will drive down costs and enhance our ability to sustain these kinds of digital resources.
What do you hope to gain from this novel approach?
We think this has a lot of potential advantages. We expect it to be cost-effective – rather than distributing resources across hundreds of institutions where they may not be used at capacity, we'll pay only for the IT resources we use. And because we frequently have multiple copies of the same digital objects, we can reduce costly duplication.
We're hoping to democratize access to these datasets. Less wealthy institutions will be able to buy only the compute capacity they need, taking advantage of the advanced capabilities in the nation's high-performance computing centers.
What's more, investigators – the people who have the best notion of what constitutes these high value data sets — will have an unprecedented degree of control over what kind of data is made available.
How does the Commons work for investigators who need control over their data?
The notion of a Commons doesn't necessarily mean that everything in there is immediately public. The requirements we've got for vendors support multiple levels of access, and I think we can craft appropriate policies that balance an investigator's need to have control over their data — particularly in the early stages, say, during a publication embargo — with the public and the research community's needs to have broad access to the data.
--For more on the NIH Commons, look here.