- Researchers at Barcelona Supercomputing Center are mapping the evolution of COVID-19 cases
- Machine learning and natural language processing tools can help doctors predict what kind of care a patient may need
- The team did all of this while working with often imperfect data
One of the worst parts of the current epidemic, other than the loss of life, has to be the uncertainty. Symptoms can take up to two weeks to present, so being exposed to a COVID-positive person can mean days of waiting. Even for those who remain uninfected, it’s still impossible to know when life will go back to normal.
With so much doubt about the future, it can be refreshing to remember that some of our smartest humans are working tirelessly to bring more certainty back into our lives. This is the case of the scientists at the Barcelona Supercomputing Center (BSC) who are using machine learning and supercomputing to help healthcare professionals predict how a patient’s COVID-19 illness will evolve.
“The objective is to develop a model that allows doctors to predict the outcome for the patients,” says principal investigator Marta Villegas. “Not only the outcome, but ideally also if they will require respiratory assistance.”
With the help of neural networks and natural language processing (NLP) tools, this BSC group hopes to help individual patients but also prepare hospitals for future influxes of COVID-related issues.
Predicting the unpredictable
Machine learning models work by assimilating data from previous cases and using that to project future results. As such, this project incorporated data from 3,051 COVID-19 episodes that correspond to 2,440 unique patients provided by HM Hospitals in Madrid, which is working with The Hospital Clínic de Barcelona (HCB).
This data included variables like patient temperature, arterial pressure, laboratory test results, the treatments prescribed, diagnosis, and more.
“For each of these sets of variables, there is temporal information,” says Villegas. “So, we know the fever for each day. And then there is our daily prediction for each of the patients. That takes into account the information of that day plus the information of all previous days. The sooner we tell the doctors that a patient is in danger of dying or requires respiratory assistance the better.”
The model is meant to function as a tool that allows clinicians to better identify problems in individual patients, but it will also create a macro-level assessment of how a hospital will be affected by COVID-19 needs. For instance, the model could tell an organization when a large number of patients will need respiratory assistance such as ventilators which may be in short supply. The advance information could help them prepare and allocate their inventory.
Of course, health data is private and there are many legal and ethical guidelines to consider when working with it. Villegas says that her group signed agreements with the hospitals that provided the data, and confidentiality has been a primary focus of the project.
Working with real data
In the midst of this pandemic, most of us are just doing our best with what we have, and Villegas and her team are no different. Much of the data Villegas received wasn’t ideal for training a machine learning model.
“You can’t use the data they supply the way it is,” says Villegas. “It reports some people with fevers of 3,000 degrees. Most of the workload had to do with cleaning, standardizing, and normalizing data.”
And while all of the health data for 3,051 COVID-19 episodes may seem like a lot, it’s a very small amount of information on which to build a neural network. Villegas says this made it all the more important to clean the data and ensure there were no empty variables or missing information.
The team also needed to find the best hyperparameter configuration to govern the training process for the neural network. This required 960 hours of compute time – spread across 20 GPUs during 48 hours – on the MareNostrom supercomputer at BSC, comparing 14,000 different possible configurations.
Additionally, a lot of the data fed into the machine learning model came from clinical reports. These are written in natural language, or the kind of language humans use to talk to each other. Machines have a hard time reading this kind of data, and Villegas and the team therefore needed to apply NLP text mining techniques in order to extract the information they needed.
Despite all of the effort put into creating this model, Villegas is committed to continuing this work. Her team plans to make their results accessible to anyone who wishes to see them, thereby making future research possible. She also seems hopeful that her team’s efforts will pay off.
“The team had no holidays this summer,” says Villegas. “We were all working on this. We had no weekends. We work ten hours a day or more. The pandemic is so close to us and we are happy to be part of this.”