• Subscribe

Mining the news for data

Speed read
  • Political scientists mine news stories for information about global events
  • TERRIER dataset extracts event data from 300 million news articles
  • Easy access to text-derived event data can provide early warnings of civil conflict

Journalists provide an invaluable service, sharing information about global events to which many of us would not otherwise have access. They send missives directly from event sites, recording what’s happening during protests, summits, speeches, and violent actions.

<strong>News you can use.</strong> The TERRIER dataset extracts event data from around 300 million news articles and puts it into a form researchers can use. Courtesy Paul Farmer. <a href='https://creativecommons.org/licenses/by-sa/2.0/'>(CC BY-SA 2.0)</a>For political scientists, these articles offer a rich mine of data. Jill Irvine, Presidential Professor of International and Area Studies at the University of Oklahoma (OU), Christan Grant, OU assistant professor of computer science, MIT political science PhD candidate Andrew Halterman, and the rest of their team want to make this data accessible.

Enter the Temporally Extended, Regular, Reproducible International Event Records (TERRIER) dataset, which extracts event data from roughly 300 million news articles and puts it into a form researchers can use.

In determining the kinds of actions that qualify as an event, the team uses a framework established within political science that events consist of actors, actions, and targets. The process of coding events consists of three steps.

<strong>Coded actions.</strong> Software searches news articles and finds words that appear to belong to actions or actors. Those terms are then assigned to known actor categories. For example, during the span of his presidency, the term ‘George W. Bush’ would be assigned the category of ‘USA government.’ Courtesy US Air Force. First, software searches every sentence in the corpus of text and parses the grammar. A second piece of software finds “candidate spans” (i.e., words that look like they belong to an actor or action), then checks a custom dictionary to see if they match with a known actor.

If a match occurs, the software assigns each actor or action to a predefined category, such as "military" actors and "protest" actions. For instance, the term “George W. Bush,” during the timespan of his presidency, would result in the actor category of “USA government.” Importantly, the dictionaries are coded to account for different ways people might speak of actions such as protests through words like “demonstrated,” “chanted,” or “carried placards.”

The project not only produces data for researchers, but also aims to improve the tools available to other researchers to generate their own datasets. To accomplish this, Grant built several open-source tools to speed the natural language processing (NLP) of the large corpus of documents.

NLP is the third—and most time-intensive—step of the event coding process, but without the grammatical information it provides, the event coder can't process the sentences. Once that step is complete, the events still need to be extracted and geolocated—more time-intensive tasks. The scale of the news corpus—hundreds of millions of documents—complicated the effort.

According to Grant, the initial projected timeline for extraction and geolocation on a single machine was many years; thus, it became clear that the team needed more resources. So they called on Jetstream, a cloud-based on-demand computing and data analysis resource which gave the team the large-scale computation and storage capabilities it needed in order expedite the process.

<strong>Conflict warning.</strong> The datatool developed by the TERRIER team can be used to gain a better sense of the causes and dynamics related to conflict and levels of violence and may even someday be helpful in providing early warning of civil conflict. Courtesy Diariocritico de Venezuela. <a href='https://creativecommons.org/licenses/by-sa/2.0/'>(CC BY-SA 2.0)</a>Grant built a distributed container system, and Jetstream provided the storage and structure to launch the pipeline and process the many documents. Grant speaks highly of the Jetstream team, calling them “responsive, helpful, and accommodating to researchers,” and a “lifesaver” for the project.

The TERRIER team’s work will serve future researchers in at least two ways. The complete event coding pipeline software is available to other researchers in NLP and political science and the dataset is available to political scientists.

Thus far, the datatool has been used to gain a better sense of the causes and dynamics related to conflict and levels of violence. Now that the process works, the team is excited to see what it will look like in application.

The team is currently working on a paper investigating the relationship between locations of the 2011 protests in Syria, and where the government exerted violence once the civil war began.

Researchers can potentially use protests reported in news media as a way of tracking prewar anti-regime mobilization and thereby measuring how mobilization affects later violence. More broadly, the data will contribute to the growing set of applied models using text-derived event data to provide early warning of civil conflict.

Read more:

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2023 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.