iSGTW Feature - Many millions of manuscripts: data mining and digitized objects

Feature - Many millions of manuscripts: data mining and digitized objects

As this Computation Institute conference room wall suggests, pen and "paper" is still a popular way to record information. But, as is also suggested, digitization can offer much in the way of improving the accessibility and utility of written materials.
Image courtesy of the Computation Institute.

Just three years old, the Computation Institute's Teraport has already consumed 2.5 million hours of computing time on more than 800,000 jobs.

James Evans, Assistant Professor in Sociology at the University of Chicago, is a Teraport regular, routinely occupying up to 30 processors at a time for his work on citation network analysis.

Multiplying results

Crunching through hundreds of CPU hours, Evans identifies patterns of interaction between universities and the biotechnology industry, using Teraport to compare the citations of every article with those of every other article in his database-more than 25 million citations.

In work that requires even more computing power, Evans also analyzes the relationships between authors and organizations producing these documents, and the words within them, to identify the scientific subfields they address.

Distributed computing was possible at his doctoral institution, Evans said, "but the computers on the network were of different sizes, had slightly different software and sometimes different operating systems."

Teraport offers a uniform operating system and software that, combined with other features, has in some cases saved Evans months of computing time, he said.

The ARTFL Project teamed up with Alexander Street Press and Teraport's computing power to create a database including playbills and posters of more than 1,200 works written by black playwrights between 1850 and 2000. This poster promotes the premiere of Ed Bullins' Duplex, performed in Harlem in 1970.
Image courtesy of ARTFL and the Hatch-Billops Collection.

Ten million digital books

The analysis of the growing numbers of digitized books and text poses massive challenges and opportunities, and UC has joined a consortium of twelve universities working to digitize up to ten million books as part of the Google Book Search Project.

"In digital humanities we will be facing massive amounts of textual material in the next three or four years," said Mark Olsen, Assistant Director of the Project for American and French Research on the Treasury of the French Language.

"There are a number of teams, including the ARTFL Project, which are ramping up to adopt machine-learning technologies on how to handle a million books.

Olsen said that although the amount of computer power required by ARTFL's projects is probably tiny compared to projects from the sciences, it is nevertheless critical to have access to this power.

"Even small tests on our highest-power machines would take 15 or 20 hours to run. These kinds of runs are much faster on the Teraport," Olsen said. "It extends our capabilities quite a bit.

The Teraport cluster is part of the Open Science Grid and is a project of the Computation Institute, a joint entity of the University of Chicago and Argonne National Laboratory.

- Steve Koppes, University of Chicago