• Subscribe

Textual analysis in parallel

A pile of books. Image courtesy Svilen Milev/stock.xchng

A great deal of knowledge is embedded in written prose. Although social science researchers sometimes extract information from text manually, the process is far too time consuming to be practical on any meaningful scale.

That's where tools like 'latent semantic analysis' come into the picture. LSA is a method that analyzes the relationships between a set of documents and the words contained in those documents. It can be used to analyze data, compare documents, find similar documents across languages, summarize or grade essays, and as part of decision-making applications.

Like many intelligent algorithms, the application of LSA requires a training phase. Training consists of mapping thousands of words (terms) contained in thousands of written passages into a concept space, also known as a term-concept matrix. Performed on a typical compute cluster, a 500-million word training phase takes less than a day to complete.

"The processors themselves could be a single workstation, a commodity cluster of linux machines, or a large supercomputer," said Tim Shead, a software architect at Sandia National Laboratories.

Shead is working with colleague Daniel Dunlavy on a set of software components called ParaText.

"The goal was to see if we could implement a parallel distributed version of latent semantic analysis," Shead explained.

"Although other packages exist that can do latent semantic analysis in parallel distributed memory architectures, we hoped to minimize communication between compute nodes and apply some of the lessons we've learned from working on large-scale visualizations," Dunlavy said.

Another goal cited by their website is to create a scalable solution that can accommodate the growing quantities of data researchers regularly handle.

ParaText distributes a different subset of documents to each processor, which in turn analyses that subset. And because of their efforts to minimize communication and make ParaText scalable, the result is a tool that could be run in a variety of environments, including on a grid or cloud. It can be embedded in any application using a native C++ API, Python, or Java. A standalone ParaText MPI executable can be run via command line. Or ParaText can be deployed as a web service using a RESTful API.

ParaText was also designed to function as part of a larger package called the Titan Informatics Toolkit.

"In the process of running our experiments we created a set of Titan components that are useful for a variety of different analysis tasks," Shead said, adding, "because Titan is open source we benefited from a number of the features that Titan provides, and we were able to contribute new features in turn."

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2020 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.