A great deal of knowledge is embedded in written prose. Although social science researchers sometimes extract information from text manually, the process is far too time consuming to be practical on any meaningful scale.
That's where tools like 'latent semantic analysis' come into the picture. LSA is a method that analyzes the relationships between a set of documents and the words contained in those documents. It can be used to analyze data, compare documents, find similar documents across languages, summarize or grade essays, and as part of decision-making applications.
Like many intelligent algorithms, the application of LSA requires a training phase. Training consists of mapping thousands of words (terms) contained in thousands of written passages into a concept space, also known as a term-concept matrix. Performed on a typical compute cluster, a 500-million word training phase takes less than a day to complete.
"The processors themselves could be a single workstation, a commodity cluster of linux machines, or a large supercomputer," said Tim Shead, a software architect at Sandia National Laboratories.
Shead is working with colleague Daniel Dunlavy on a set of software components called ParaText.
"The goal was to see if we could implement a parallel distributed version of latent semantic analysis," Shead explained.
"Although other packages exist that can do latent semantic analysis in parallel distributed memory architectures, we hoped to minimize communication between compute nodes and apply some of the lessons we've learned from working on large-scale visualizations," Dunlavy said.
Another goal cited by their website is to create a scalable solution that can accommodate the growing quantities of data researchers regularly handle.
ParaText distributes a different subset of documents to each processor, which in turn analyses that subset. And because of their efforts to minimize communication and make ParaText scalable, the result is a tool that could be run in a variety of environments, including on a grid or cloud. It can be embedded in any application using a native C++ API, Python, or Java. A standalone ParaText MPI executable can be run via command line. Or ParaText can be deployed as a web service using a RESTful API.
ParaText was also designed to function as part of a larger package called the Titan Informatics Toolkit.
"In the process of running our experiments we created a set of Titan components that are useful for a variety of different analysis tasks," Shead said, adding, "because Titan is open source we benefited from a number of the features that Titan provides, and we were able to contribute new features in turn."