- Nineteenth-century prison cards form a pre-digital database
- Complex handwritten records test machine learning abilities
- XSEDE connects humanists with supercomputing infrastructure
Paris, 1885: A man is arrested for theft. At the police station, a clerk records his name, photographs his face and profile, measures his body in fourteen places, and notes any scars or tattoos.
Has he been arrested before? It’s difficult to say.
By this time, the population of Paris is nearly 2 million, and police can’t recognize every criminal personally. National ID cards don’t yet exist. Fingerprinting won’t be accepted for another twenty years.
So the prisoner’s body is broken apart — decomposed — into data and recorded on a specialized Bertillon card. Filed according to a complex system, Alphonse Bertillon’s method effectively creates a pre-digital database of recidivism.
The Bertillon method soon spreads beyond Paris to the United States, where completed cards still exist in archives and historical societies.
What can these cards of long-dead prisoners tell us today?
With the help of scientists from the Extreme Science and Engineering Discovery Environment (XSEDE), she and art history colleague Josh Ellenbogen are analyzing over 12,000 nineteenth-century cards from the Ohio State Reformatory and Ohio State Penitentiary.
“Building this dataset is like planting a seed in the ground,” says Langmead. “The dataset will flower into many different subject areas and tell us more about the human experience.”
What computers don’t see
The collected cards, which contain a wealth of sociological information about prisons, including a prisoner’s race, nationality, crime committed, eye color, and even ear shape, account for the vast majority of Ohio’s state prisoners from 1897 to 1911.
Langmead and her students visited the Ohio History Connection in Columbus and photographed both sides of 12,500 cards.
But transforming the resulting 25,000 images from raw historical artifacts into a usable dataset is a challenging task: The cards are old, warped, and the ink is blotchy.
And then there’s the nineteenth-century handwriting.
“All these steps that we usually assume are easy, once you add a little bit of a noise, recognition becomes a very difficult problem,” says Paul Rodriguez, research analyst at the San Diego Supercomputer Center (SDSC).
Rodriguez, who is using the cards to enhance cursive-offline handwriting recognition capabilities on the Comet supercomputer, explains that many machine-learning image-recognition research works with data that has been pre-cleaned.
While the Bertillon cards contain text of predictable types in a constrained form, humans are virtually unconstrained in how they fill them out.
“In many ways, the negative results are more interesting than the positive results,” says Rodriguez. “This project let us explore what these so-called ‘deep’ computer machine-learning methods are capable of and how far they can go for us.”
To store and manage the variations on the front side card images, Langmead looked to Sandeep Puthanveetil Satheesan, research programmer in the Innovative Software and Data Analysis Division at the National Center for Supercomputing Applications.
Puthanveetil Satheesan advised the team to use the web-based, scalable, open source research data management software called Clowder. They then processed the images on the Bridges supercomputer at the Pittsburgh Supercomputing Center, drawing on extractor services from the Brown Dog project.
“The importance of identifying as many variations in data and as early as possible in the lifespan of a project can’t be emphasized enough, especially when the processing algorithms are dependent upon the grouping of data based on such variations,” says Puthanveetil Satheesan.
Who needs supercomputing?
Langmead, Rodriguez, and Puthanveetil Satheesan were brought together by Alan Craig, digital humanities specialist for the XSEDE project.
“My role is to engage new and different kinds of projects, particularly from the arts and humanities,” says Craig. “I find the technical people to work with them and make sure the computing resources are available.”
Academics in all fields can make the mistake of believing that humanists don’t need technology to complete their research. But that’s not so, says Langmead.
“There are lots of reasons why humanists would need supercomputing infrastructure. In my case, I had really interesting data and colleagues who wanted to try out these systems on it. It’s a match made in the twenty-first-century academy.”
“We’re all scientists, we’re all looking for knowledge,” she says. “The computer is a tool — it can be used for whatever we need.
All humanists are already digital humanists in some small measure. Some of us also take it as our charge to become much more highly engaged with the ways that computing can be used to facilitate our research.”