US Library of Congress faces big data challenge for Twitter archive

"As society turns to social media as a primary method of
communication and creative expression, social media is
supplementing, and in some cases supplanting, letters,
journals, serial publications and other sources routinely
collected by research libraries," says the US Library of
Congress's director of communications, Gayle Osterberg.
Image courtesy Clarissa Peterson, Flickr ().

's not just about finding out what Lady Gaga had for breakfast or sharing amusing photos of cats, you know. It's also about getting into arguments with strangers and watching sports stars make fools of themselves.

More importantly (and more seriously) though, it's an utterly huge repository of social data which researchers are just itching to get their hands on. From mapping political trends, to analyzing social interactions and examining the way memes spread across the globe, the 500-million-user-strong microblogging site is a potential treasure trove of information.

To this end, the signed a deal in April 2010 to create a complete archive of the Twitterverse. And, last week, the Library published . According to the report, the library has an archive of around 170 billion tweets, stretching back to Twitter's launch in 2006. The tweets each have over 50 accompanying metadata fields and the archive now totals over 130 terabytes in size. It's expanding rapidly too, with new tweets now being added on an hourly basis via the help of social media aggregation company . This means that the archive is growing at a staggering rate of over half billion new tweets each day, up from around 'just' 140 million tweets a day in early 2011.

However, it's not storing all of this data which is a problem for the Library of Congress. Rather, making the archive easily searchable is proving tricky. A single search of the archive of tweets - just one eighth the size of the entire volume - can currently take up to 24 hours. Yet, even once this searchability issue has been cracked, it still doesn't mean everyone will be able to poke around the Twitter archive. As part of the Library's agreement, it will make the archive available only to "bona fide" researchers. So far, over 400 applications have been made to access the archive, including projects looking at vaccination rates, citizen journalism, and the stock market. As of yet though, there have apparently been no applications to answer that most perplexing of Twitter mysteries: namely, how on Earth does Justin Bieber have 32 million followers?

Find out more about the big data challenges the Twitter archive poses by reading the Library of Congress's report in full .

- Andrew Purcell

US Library of Congress faces big data challenge for Twitter archive

Share this story

Tags

Join the conversation

Funding partners

Categories

Connect with us

Contact

Science Node

Republish

Before You Go

Subscribe to our newsletter

Login to ScienceNode

US Library of Congress faces big data challenge for Twitter archive

Share this story

Tags

Join the conversation

Funding partners

Categories

Connect with us

Contact

Science Node

Republish

Before You Go