Culturomics: what can you learn from five million books?

Irregular verbs
A slide from the TED talk by Erez Lieberman Aiden and Jean-Baptiste Michel at Google. Image courtesy TED

In January 2011, programmers at Google published a paper in the journal Science, following the digitization of about five million books - or 4% of all the books ever printed. The paper, called "Quantitative Analysis of Culture Using Millions of Digitized Books", shows some interesting phenomenon in the English language between 1800 and 2000.

In case you missed the original paper, two of the authors, Erez Lieberman Aiden and Jean-Baptiste Michel, have given a humorous TED talk, released last week, so you can watch them explain it themselves. It reveals a lot about ourselves.

One thing the authors looked at is the evolution of words, such as the frequency of the past participles of verbs. As shown in the image, it's clear that over the last 200 years, 'thrived' has become more popular than 'throve' as the past participle of 'to thrive'.

They also looked at the effect of censorship on the censored author's 'fame', measured by number of mentions. It's a technique that is so powerful, they can predict whether someone was censored based on the patterns of their mentions. And, perhaps more interestingly, they can also see the effect of propaganda by reversing the pattern.

The frequency of usage for grid computing (blue) versus volunteer computing (red) between 1990 and 2000. Vertical segments represent 0.00000002% of all terms in books printed in each year. Image courtesy of Googe Ngram Viewer.

The analysis all relies of something called the n-gram, which is a string of words. For example, high performance computing would be a 3-gram.

Google recently made their N-gram Viewer publicly available so you can search it yourself for terms - for example, if you wanted to see the frequency of usage of the terms grid computing (blue) versus volunteer computing (red), you end up with the graph to the right.

The graph shows that there was some usage of the term 'grid computing' in the early 1990s, possibly when it was being used as a metaphor among experts. Not until 1996 did it become more popular.

The term 'volunteer computing' came into existence with force in 1995, which corresponds with the conception of volunteer computing projects - the first project, which searches for prime numbers, was released in January 1996.

Unfortunately, the Ngram viewer only features books printed before 2001. So, we will have to wait until the next iterations to discover exactly what happens to the upward trend seen at the end of the 1990s.

