- Lack of reproducibility hinders scientific discovery
- Dataverse repository helps researchers share, analyze, and cite data across disciplines
- Greater transparency supports researchers working together to produce more reliable science
Reproducibility is an essential component of reliable science. But according to a study conducted by Nature, more than 70 percent of researchers surveyed failed to reproduce another scientists’ work.
If researchers can’t replicate previous results, they may waste time following false leads or publish inaccurate or incomplete information.
One solution to this problem is the Dataverse Project, an open-source repository from Harvard's Institute for Quantitative Social Sciences. The project provides a space for researchers to share, analyze, preserve, cite, and explore data from a variety of fields so that it can be replicated more easily.
Researchers can upload their own data and also search data from other users. The software allows organizations to set up their own data repositories—or dataverses—which they can customize for their needs. It also provides a space for a growing community of developers and users to promote data sharing and data access.
It begins with necessity
The Dataverse Project began in 2006 when researchers at the Harvard-MIT Data Center were growing tired of how difficult it was to share their datasets.
“Back then they had to make a CD just to be able to take the data away with them and bring it back,” says Mercé Crosas, Dataverse’s co-principal investigator. “At some point, this turned into ‘let’s build a web application to do this.’ Then that grew into ‘let’s build incentives so that the entire research community can share data sets with us.’”
Now, 35 different organizations around the world have a Dataverse installation which can host multiple repositories. To date, more than fifty thousand datasets have been downloaded more than 4 million times. Dataverse’s open-source code allows researchers, journals, and institutions to easily access data from a variety of fields, from biomedical research to astronomy.
Making data visible
Crosas concedes that reproducibility may not always be one hundred percent possible, but Dataverse can at least provide an additional layer of transparency.
“There are definitely cases where it would be very difficult to reproduce the entire process,” Crosas says. “On the other hand, that should not be an excuse for those cases.”
Taking a big step towards transparency, Dataverse has integrated Code Ocean, a cloud-based computational reproducibility platform. This means that instead of setting up a separate environment to run the code, researchers can upload their data and their scientific code directly to their dataverse and run it right there using Code Ocean.
Other scientists can also run the code inside the dataverse, without installing anything on their personal computer. This makes it easier for outside researchers and publications to verify the reproducibility of the research.
And now many agencies that fund research and journals that publish results have started to require that research data be made publicly available.
“All of this drives home the importance of making data accessible, which has made Dataverse more widely known and more frequently used,” Crosas says.
Protecting sensitive data
Going forward, Crosas says Dataverse is growing more concerned about the security of sensitive data. Not only in protecting it from hacks, but also in educating users about how certain kinds of information should be handled.
“We need to make sure that we not only provide all of the security environment requirements needed for the data, but also that the data depositor—the researcher that has the data—and the user who wants to use that data understand what it means to share datasets that contain sensitive information,” Crosas says.
To that end, Dataverse has developed a color-coded chart that categorizes data into six levels of sensitivity, ranging from completely open to very sensitive.
Examples of highly secure data would be health data protected by HIPAA (Health Insurance Portability and Accountability Act), student data protected by FERPA (Family Educational Rights and Privacy Act), social media data, or data protected by GDPR (General Data Protection Regulation). However, Crosas says these distinctions are growing increasingly blurred.
“As we work with more complex data, not only the disciplines within the medical community and social sciences, but also research that is interdisciplinary with combined datasets from different resources, it becomes more difficult to understand what is private and what is not.”
Although there is still much to do, Crosas is proud of Dataverse’s evolution.
“The growth of the community has converted the project from something that I had to make work day by day to something that, now, has become a whole group contributing to make it work,” says Crosas.
The Dataverse projects facilitate in-person interactions between researchers through frequent regional meetings and an annual Global Dataverse Community Consortium, where users from around the world come together and collaborate.
As researchers continue to rely on each other more and more for reliable data across disciplines, teamwork and collaboration are the way of the future for reproducibility and accurate science.