The data from CERN's Large Hadron Collider (LHC) is exotic, complicated, and big. At peak performance, about one billion proton collisions take place every second inside the CMS detector at the LHC. CMS has collected around 64 petabytes of analyzable data from these collisions so far.
Along with the many published papers, this data constitutes the scientific legacy of the CMS collaboration, and preserving the data for future generations is of paramount importance. "We want to be able to re-analyze our data, even decades from now," says Kati Lassila-Perini, head of the CMS Data Preservation and Open Access project at the Helsinki Institute of Physics. "We must make sure that we preserve not only the data but also the information on how to use them. To achieve this, we intend to make available through open access our data that are no longer under active analysis. This helps record the basic ingredients needed to guarantee that these data remain usable even when we are no longer working on them."
CMS is now taking its first steps in making up to half of its data accessible to the public, in accordance with its policy for data preservation, re-use and open access. The first release will be made available in the second half of 2014, and will comprise of a portion of the data collected in 2010.
Taking LHC data into classrooms
LHC data is also openly available throughthe CERN Masterclasses. This program provides real experimental data from the ATLAS, LHCb, CMS and Alice experiments for analysis.
Although, in principle, providing open scientific data will allow potentially anyone to sift through and perform analyses of their own, in practice doing so is very difficult: it takes CMS scientists working in groups and relying upon one another's expertise many months or even years to perform a single analysis that must then be scrutinized by the whole collaboration before a scientific paper can be published. A first-time analysis typically takes about a year from the start of preparation to publication, not taking into account the six months it takes newcomers to learn how to use the analysis software.
CMS, therefore, decided to study a concrete use-case for its open data by launching a pilot project aimed at education. This project, partially funded by the Finnish Ministry of Education and Culture, will share CMS data with Finnish high schools and integrate it into their physics curriculum. This data will be part of a general platform for open data provided by Finland's IT Center for Science (CSC).
CMS data is classified into four levels in increasing order of complexity of information:
- Level 1 encompasses data included in CMS publications. In keeping with CERN's commitment to open-access publishing, all the data contained in these documents and any additional numerical data provided by CMS are, by definition, open.
- Level 2 data are small samples that are carefully selected for education programs. They are limited in scope: while students get a feel for how physics analyses work, they cannot do any in-depth studies.
- Level 3 is made of data that CMS scientists use for their analyses. It includes meaningful representations of the data, along with simulations, documentation needed to understand the data, and software tools for analysis. CMS is making this analyzable level-3 data available publicly, in a first for high-energy physics.
- Level 4 consists of the so-called 'raw' data -- all the original collision data, without any physics objects, such as electrons and particle jets, identified. CMS will not make this data public.
One aim of the pilot project is to enable people outside CMS to build educational tools on top of CMS data, which will let high-school students to do some simple but real particle-physics analysis, a bit like CMS scientists. "We want to create a chain where the CSC as an external institute can read our data with simplified analysis tools and convert them into a format adapted for high-school-level applications," explains Lassila-Perini.
Such a chain is crucial for making the CMS data accessible to wider audiences. Within the collaboration, you need a lot of resources to perform analyses, including lots of digital storage and distributed computing facilities. "If someone wants to download and play with our data," cautions Lassila-Perini, "you can't tell them to first download the CMS virtual-machine running environment, ensure that it is working and so on. That's where we need data centers like the CSC to act as intermediate providers for applications that mimic our research environment on a small scale."
Finland is an ideal case to pilot a program that formally introduces particle physics into school curricula. 75% of Finnish high schools have classes that have visited CERN as part of their courses and, thanks to CERN's teacher programs, many of their teachers are familiar with the basics of particle physics. An ongoing survey of the teachers will help understand their perspectives on teaching data analysis in their classrooms and take on board ideas for potential applications.
Lassila-Perini has big dreams: "Imagine a central repository of particle-physics data to which schools can sign in and retrieve data," she says. "They collaborate with other high schools, develop code together and perform analyses, much like how we work. It is important to teach not just the science but also how science works: particle physics research isn't done in isolation but by people contributing to a common goal."
Ongoing success stories with open CMS data set the stage for the pilot project. For example, the CMS Physics Masterclasses exercise, developed by QuarkNet and conducted under the aegis of the International Particle Physics Outreach Group, as well as projects such as the I2U2 e-Labs, introduce particle physics to thousands of high-school students around the world each year by teaching them to perform very simple analyses with level-2 CMS data.
A second project with more academic goals is being undertaken by CMS members at RWTH Aachen University in Germany, where third-year undergraduate physics students use web-based tools to analyze level-2 CMS data as part of courses on particle and astroparticle physics. Among other things, they learn how to calculate the masses of particles produced at the LHC.
An independent use-case for public CMS data comes from the field of statistics. A group of statisticians from the Swiss Federal Institute of Technology in Lausanne (EPFL) have found the level-2 data sample to be a perfect testing ground for different statistical models. This excites Lassila-Perini: "When you provide data this way, you are not defining the end users - open data are open data!" she exclaims.
There is no doubt that other fields of science will also benefit from the release of particle-physics data by CMS. The success of the pilot project will guide future open-data policies, and CMS is well placed to lead the field.
This article was originally published on the CMS website.