Mountains of data

Long, long ago, in a time when professional travel was still possible, Rachana Ananthakrishnan, executive director & head of products for at the University of Chicago, visited the Indiana University (IU) campus in Bloomington to give the annual , sponsored by IU’s and .

She also stopped by the Science Node offices to share her insights on how cyberinfrastructure providers have been shaped by unique aspects of the research enterprise. Recent disruptive technologies and trends are creating new challenges and opportunities in data management for today’s research community.

Welcome, Rachana! I’ve been hearing a lot recently about data overload in the research community. Can you break down what that really means and why it’s a problem?

Fast, reliable access to large datasets is vital for researchers racing to cure diseases or predict destructive weather such as hurricanes. Researchers have lots of data they have to organize and process. Data sets are growing in size and volume, and data is being generated everywhere. Instruments used in cutting-edge research, such as , are creating mountains of data. The speed at which researchers need to access data is more important than ever as they race to find cures for diseases, predict the location of the next earthquake, or follow the path of the next hurricane.

Researchers want to spend more time on their research and less time on data management – it’s not their end goal. Their mission is to get insights out of the data and derive value from the data as quickly as possible. These petabytes and even terabytes of data are becoming more distributed, whether they exist in an institution’s data center or in the cloud. Collaborating in a simple and secure manner is needed, so researchers are turning to modern tools and frameworks.

In almost all cases, there are standard operations people do. They have to move data across systems for processing, visualization, and archiving. They have to share with collaborators and do that securely. There’s metadata description to associate for discovery. And finally there’s automating these processes as much as possible for efficiency. For example, constructing data pipelines that researchers can run easily. Our team at Globus provides constructs for these with the mission of reducing time to science and discovery.

Can you give us an example?

I’ll take an example presented at our user conference last year: , an assistant professor in radiology at Harvard Medical School works with the SIGNAL project to study drug efficacy in rare diseases. Since we’re talking about people with rare diseases, the individual patients are widely distributed across the country.

Biomedical imaging data, like MRI scans of the brain, may one day improve the lives of people with rare diseases. But transferring this data from outpatient centers and processing it for researchers presents additional challenges. Courtesy MGH Martinos Center for Biomedical Imaging. MRI and PET scans are performed at clinical sites, often outpatient centers and some research sites. All of the resulting data needs to be pushed to the core processing facility at the at Massachusetts General Hospital for analysis and processing.

The tools for this must be easy for the technicians at an outpatient center to install and use. But they must also comply with Institutional Review Board rules and security policies, and navigate firewall and other infrastructure protections set up at each site. And they must reliably move the data and ensure access only for authorized users. Not to mention continuing to manage the data throughout its life cycle requires some automation so that it doesn’t consume significant amounts of time.

How is this different than in other sectors?

Academic research is a collaborative enterprise and the vast majority crosses multiple institutional boundaries. Typically, a grant has participants from multiple institutions, and you have to navigate the security policies across all of these for the project to succeed.

Rachana Ananthakrishnan, executive director and head of products for Globus. A researcher wants to be able to say, “Here is some specific data shared with you, but only for a period of time, or for use on this particular service.” In some cases they might have information that requires higher security, such as data with or , yet still need to grant collaborator access.

But even practitioners of open science whose data is publicly accessible still want closed access for some part of the data lifecycle, and ultimately they want usage metrics. So a model where there is a big wall separating internal and external resources that may work in other sectors isn’t sufficient here. Architectures and services that ensure researchers are able to work collaboratively while still adhering to the required security standards are key.

Given the conflict between keeping data secure but also making it usable, how do you develop a product that meets both needs?

We invest quite a bit in understanding the user journey for researchers and administrators using our product. For the person who comes in and wants to use the system, we want to make the experience as frictionless as possible—simple to manage their own data and share it. But as you noted, there is always tension between the various security requirements and simplifying the user journey.

So we think about things like: Is it easy for users to find what they are looking for? How often do they have to login? How can we help them navigate credentials needed for the resources they access across multiple institutions?

Keeping data secure. Balancing security with the needs and preferences of users is always a concern when building tools and resources for the research community. There are also security architecture and best practices, and how that translates into practice for the user. For something as simple as moving data from say, Indiana University to University of Chicago, there are many steps that span multiple services: the user has to login to each system separately, grant services permissions to act on their behalf to list and transfer files, and grant services to look up their group membership to determine if they have permissions and so on.

The “least privilege” security model isolates each of these capabilities as independent services and provides minimum access to them, as needed. But in that case, the user has to be prompted for the various steps for what is a single action of moving data from one storage to another.

They’ll be prompted: “Do you want to allow service A?” Later they get asked: “Service B is needed, do you want to allow that?” And you wonder about the cognitive load on the user. Does the user really understand—do they need to understand—everything that goes on behind the scenes? As an end user, how much do I want to know?

This is a hard balance to strike between the security needed and acceptable risk and the actual mitigation of risk by placing more responsibility on the user. These are some of the issues we deal with as we design our products.

It seems like you really have to try to put yourself in the user’s shoes. What makes it worth it?

When I hear the stories of the kinds of science we enable. I feel there’s a big societal impact—it’s really cutting edge. It’s exciting and humbling all at once.

Big societal impact. Ananthakrishnan’s work lets her play a role in cutting-edge projects that map the universe or try to find out what happens in the brains of jazz players when they get creative. Courtesy Gavin Whitner. <a href='https://creativecommons.org/licenses/by/2.0/'>(CC BY 2.0)</a> It’s wonderful to meet our users and talk to them. I’ve had some of the best conversations, sitting across the table from someone telling me about their research on dark matter or how they’re mapping the universe, or even mapping the human brain! From research on creativity where they’re doing MRIs of jazz players as they improvise to find out what happens in your brain when you get creative, to work on understanding climate change and impacts, we get a chance to work with and meet a wide variety of engaged and passionate scientists.

To think that we actually play a small role in making that happen—that’s what has kept me in this space for so many years. It’s very exciting to be in this space.

Mountains of data

Welcome, Rachana! I’ve been hearing a lot recently about data overload in the research community. Can you break down what that really means and why it’s a problem?

Can you give us an example?

How is this different than in other sectors?

Given the conflict between keeping data secure but also making it usable, how do you develop a product that meets both needs?

It seems like you really have to try to put yourself in the user’s shoes. What makes it worth it?

Share this story

Tags

Join the conversation

Our Underwriters

Categories

Contact

Science Node

Republish

Subscribe to our newsletter

Login to ScienceNode

Mountains of data

Welcome, Rachana! I’ve been hearing a lot recently about data overload in the research community. Can you break down what that really means and why it’s a problem?

Can you give us an example?

How is this different than in other sectors?

Given the conflict between keeping data secure but also making it usable, how do you develop a product that meets both needs?

It seems like you really have to try to put yourself in the user’s shoes. What makes it worth it?

Share this story

Tags

Join the conversation

Our Underwriters

Categories

Contact

Science Node

Republish