Douglas Duhaime says reading a book feels a bit like spelunking. Not so much in the dangling at the end of a rope down a dark cavern sense - more in the searching vast spaces for gems sense.
Duhaime, ProQuest text and data mining project manager, is one of a new breed of humanities scholars: a digital humanist. Using vast text repositories like the HathiTrust archive described in our 17 June feature, Duhaime looks across decades and centuries of publications, seeing patterns otherwise invisible to human perception.
"Every once in a while I come across some very great treasure," says Duhaime.
Earlier this year Duhaime won an Advanced Collaborative Support grant from the HathiTrust Research Center (HTRC) to use machine-learning techniques to research the writing of Oliver Goldsmith, a notorious plagiarist. Duhaime's larger project focuses on understanding the nature of creativity and the arc of literary influence.
By mining big text archives like the HathiTrust, Duhaime tracks textual reuse at scale and traces patterns in the transmission of ideas and language. Another term for this is plagiarism, and even in the 17th and 18th centuries - Duhaime's focus - there was a stigma for unjustified reuse of extant ideas and expression.
"Legally speaking, we hadn't codified our notion of copyright until roughly 1710, with the Statute of Anne in England. It took a little while to spread to other countries, but even in England it took until the Donaldson v Beckett case to really enforce the rules created by the Act."
A strict definition of plagiarism is probably impossible to draw, Duhaime reasons. "Plagiarism is the co-occurrence of statistically improbable phrases or the statistically improbable co-occurrence of expressions. The more unique the expression and the smaller number of occurrences of that expression, the greater the probability of some kind of reuse of interest."
Duhaime has trained Notre Dame's DACCS supercomputer to search repositories across centuries of texts, essentially 'finger-printing' ideas to trace unjustified reuse. DACCS is a Dell cluster with 54 nodes and 64 cores per node, at 128 Gb of memory per node. Since there are roughly 350,000 publications in the 18th century alone, the kind of text mining Duhaime does simply wouldn't be possible without 21st century technology.
This sort of comparative analysis can help identify reused expressions and ideas not caught by commercial packages like Turnitin.com, which only track web-archived texts. Plagiarism reaches beyond academia, however. Software creators also want to know if subsets of their code are being sampled illegitimately.
Even if you're not interested in catching cheats or tracking reuse across time, Duhaime suspects you'll find merit in the research he's conducting. "Drilling down more deeply into this period and the foundation of copyright - a time when they were creating the rules around which creative expressions can be circulated - can tell us a lot about who we are and the history of creativity that's led us to where we are presently."