iSGTW is now Science Node Learn more about our evolution

  • Subscribe

Text mining reveals a new story of individualism

Douglas Duhaime, ProQuest text and data mining project manager. Courtesy Douglas Duhaime.

Douglas Duhaime says reading a book feels a bit like spelunking. Not so much in the dangling at the end of a rope down a dark cavern sense - more in the searching vast spaces for gems sense.

Duhaime, ProQuest text and data mining project manager, is one of a new breed of humanities scholars: a digital humanist. Using vast text repositories like the HathiTrust archive described in our 17 June feature, Duhaime looks across decades and centuries of publications, seeing patterns otherwise invisible to human perception.

"Every once in a while I come across some very great treasure," says Duhaime.

Earlier this year Duhaime won an Advanced Collaborative Support grant from the HathiTrust Research Center (HTRC) to use machine-learning techniques to research the writing of Oliver Goldsmith, a notorious plagiarist. Duhaime's larger project focuses on understanding the nature of creativity and the arc of literary influence.

By mining big text archives like the HathiTrust, Duhaime tracks textual reuse at scale and traces patterns in the transmission of ideas and language. Another term for this is plagiarism, and even in the 17th and 18th centuries - Duhaime's focus - there was a stigma for unjustified reuse of extant ideas and expression.

"Legally speaking, we hadn't codified our notion of copyright until roughly 1710, with the Statute of Anne in England. It took a little while to spread to other countries, but even in England it took until the Donaldson v Beckett case to really enforce the rules created by the Act."

A strict definition of plagiarism is probably impossible to draw, Duhaime reasons. "Plagiarism is the co-occurrence of statistically improbable phrases or the statistically improbable co-occurrence of expressions. The more unique the expression and the smaller number of occurrences of that expression, the greater the probability of some kind of reuse of interest."

Duhaime has trained Notre Dame's DACCS supercomputer to search repositories across centuries of texts, essentially 'finger-printing' ideas to trace unjustified reuse. DACCS is a Dell cluster with 54 nodes and 64 cores per node, at 128 Gb of memory per node. Since there are roughly 350,000 publications in the 18th century alone, the kind of text mining Duhaime does simply wouldn't be possible without 21st century technology.

This sort of comparative analysis can help identify reused expressions and ideas not caught by commercial packages like, which only track web-archived texts. Plagiarism reaches beyond academia, however. Software creators also want to know if subsets of their code are being sampled illegitimately.

Even if you're not interested in catching cheats or tracking reuse across time, Duhaime suspects you'll find merit in the research he's conducting. "Drilling down more deeply into this period and the foundation of copyright - a time when they were creating the rules around which creative expressions can be circulated - can tell us a lot about who we are and the history of creativity that's led us to where we are presently."

--Lance Farrell

Join the conversation

Do you have story ideas or something to contribute?
Let us know!

Copyright © 2015 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.