Our ability to distinguish between multiple word meanings is rooted in a lifetime of experience. We can quickly differentiate between the 'charge' of a battery and criminal 'charges'. Using context, an intrinsic understanding of syntax and logic, and a sense of the speaker's intention, we discern what another person is telling us.
For computers, this process is not so simple. For more than 50 years, linguists and computer scientists have tried to get computers to understand human language using semantics software. Driven initially by efforts to translate Russian scientific texts during the Cold War, these efforts have met with mixed success.
"In the past, people have tried to hand code all of this knowledge," explains Katrin Erk, a professor of linguistics (specializing in lexical semantics) at The University of Texas at Austin, US. "I think it's fair to say that this hasn't been successful. There are just too many little things that humans know."
Watching annotators struggle to make sense of conflicting definitions led Erk to try something new. Instead of hard-coding human logic or deciphering dictionaries, her approach mines vast bodies of text (which are a reflection of human knowledge) and uses the implicit connections between words to create a weighted map of relationships.
"An intuition for me was that you could visualize the different meanings of a word as points in space," she said. "You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations ("the newspaper published charges..."). The meaning of a word in a particular context is a point in this space. Then we don't have to say how many senses a word has. Instead we say: 'This use of the word is close to this usage in another sentence, but far away from the third use.'"
By considering words in a relational, non-fixed way, Erk's research draws from emerging ideas in psychology of how the mind deals with language and concepts in general. Instead of rigid definitions, concepts have unclear boundaries where the meaning, value, and limits of an idea can vary considerably according to the context or conditions.
Accurately modeling the intuitive ability to distinguish word meaning requires a lot of text and a lot of analytical horsepower. "The lower end for this kind of a research is a text collection of 100 million words," says Erk. "If you can give me a few billion words, I'd be much happier."
The Longhornvisualization cluster at the Texas Advanced Computing Center (TACC) enables Erk and her collaborators to expand the scope of their research. Computational models that take weeks to run on a desktop computer can run in hours on Longhorn. "With Hadoop on Longhorn, we could do language processing much faster. That enabled us to use larger amounts of data and develop better models."
"We use a gigantic 10,000-dimentional space with different points for each word to predict paraphrases," Erk explains. "If I give you a sentence such as, 'This is a bright child,' the model can tell you automatically what are good paraphrases ('an intelligent child') and what are bad paraphrases ('a glaring child'). This is quite useful in language technology."
Erk's paraphrasing research may be critical to automatic information extraction. Say, for instance, you want to extract a list of diseases and their causes, symptoms, and cures from millions of pages of online medical information: researchers use slightly different formulations when talking about diseases, so having good paraphrases will be key.
Erk and Ray Mooney, a computer science professor also at The University of Texas at Austin, recently received a grant from the Defense Advanced Research Projects Agency (DARPA) to combine Erk's dimensional space representation of word meanings with a method of determining the structure of sentences based on Markov logic networks.
In a paper presented at the Second Joint Conference on Lexical and Computational Semantics ("Montague Meets Markov: Deep Semantics with Probabilistic Logical Form"), Erk, Mooney, and colleagues announced results of challenge problems. In one challenge, Longhorn was given a sentence and had to infer whether another sentence was true based on the first. Using an ensemble of different sentence parsers, word meaning models, and Markov logic implementations, Mooney and Erk's system predicted the correct answer with 85% accuracy - near the top results in this challenge.
"We want to get to a point where we don't have to learn a computer language to communicate with a computer. We'll just tell it what to do in natural language," Mooney says. "We're still a long way from having a computer that can understand language as well as a human being does, but we've made definite progress toward that goal."