• Subscribe

AI revitalizes rare language

Speed read
  • Endangered Seneca language has fewer than 50 fluent speakers
  • Deep-learning speech-recognition application will collect data and transcribe from remaining speakers
  • Model could help preserve and revitalize other rare or vanishing languages

A new research project at Rochester Institute of Technology (RIT) will help ensure the endangered language of the Seneca Indian Nation will be preserved.

Using deep learning, a form of artificial intelligence, RIT researchers are building an automatic speech recognition application to document and transcribe the traditional language of the Seneca people. The work is also intended to be a technological resource to preserve other rare or vanishing languages.

Speech saversRobbie Jimerson, a doctoral student in RIT's computer science program and resident of the Cattaraugus Indian Reservation, is using technology to help future generations study and speak the endangered language of the Seneca Indian Nation. Courtesy RIT.

"The motivation for this is personal. The first step in the preservation and revitalization of our language is documentation of it," said Robert Jimerson (Seneca), a computing and information sciences doctoral student at RIT and member of the research team.

He brought together tribal elders and close friends, all speakers of Seneca, to help produce audio and textual documentation of this Native American language spoken fluently by fewer than 50 individuals.

Like all languages, Seneca has different dialects. It also presents unique challenges because of its complex system for building new words, in which a whole sentence can be expressed in a single word.

Jimerson is able to bridge both the technology and the language.

"Under the hood, it is data. With many Native languages, you don't have that volume of data," he said, explaining that some languages, while spoken, may not have as many formal linguistical tools--dictionaries, grammatical materials or extensive classes for non-native speakers, similar to those for Spanish or Chinese.

<strong>Preservation team.</strong> Ray Ptucha (l), computer engineering assistant professor, Robbie Jimerson (m), computer science doctoral student, both from RIT, and Emily Prud’hommeaux (r), assistant professor of computer science at Boston College, are leading the NSF project to use artificial intelligence technology to preserve the Seneca language. Courtesy RIT."One of the most expensive and time-consuming processes of documenting language is collecting and transcribing it. We are looking at taking deep networks and maybe changing the architecture, making some synthetic data to create more data, but how do you make this work in deep learning? How do you augment data you already have?"

That process of attaining data is being coordinated by a wide-ranging team that includes Jimerson; the project principal investigator Emily Prud'hommeaux, assistant professor of computer science at Boston College and research faculty in RIT's College of Liberal Arts; Ray Ptucha, assistant professor of computer engineering in RIT's Kate Gleason College of Engineering and an expert in deep learning systems and technologies; and Karin Michelson, professor of linguistics, the State University of New York at Buffalo.

The research team was awarded $181, 682 in funding over four years from the National Science Foundation for "Collaborative Research: Deep learning speech recognition for documenting Seneca and other acutely under-resourced languages."

"This is an exciting project because it brings together people from so many disciplines and backgrounds, from engineering and computer science to linguistics and language pedagogy," said Prud'hommeaux. "In addition to enabling us to develop cutting edge technology, this project supports undergraduate and graduate students and engages members of an indigenous community that few people know is right here in western New York."

<strong>Bilingual stop signs</strong> on the Allegany Indian Reservation in Jimerson Town, NY. The Seneca language presents unique challenges because of its complex system for building new words. Courtesy JMyrleFuller. <a href='https://creativecommons.org/licenses/by-sa/4.0/deed.en'>(CC BY-SA 4.0)</a>The researchers started the project in late June, bringing together the community members and linguists for data collection--acquiring and translating current and new, original recordings of Seneca conversations then converting data into textual output using deep learning models.

"What you are really trying to do is find that line between the new data you can get and the changing of the architecture of a network," Jimerson explained.

Since the summer, the team has just over 50 hours of recorded material with people working full time on the translations that include breaking down the language into individual phonetic symbols and using this information to begin training the models.

"We use a process called transfer learning which starts with a model trained with readily available English speech to get the basic, initial training for the system, then we'll re-train the neural networks and fine tune it toward the Seneca language. We're getting very good results," said Ptucha, who is an expert in deep learning systems and technologies.

Deep learning technology consists of multiple layers of artificial neurons, organized in an increasingly abstract hierarchy. These architectures have produced state-of-the-art results on all types of pattern recognition problems including image and speech recognition applications.

"No one has really tried this before, training an automated speech recognition model on something as resource-constrained as Seneca. Robbie is the expert in transcribing Seneca and training the others on how to do this. He's a pretty rare guy, " said Ptucha.

Read the original article on RIT's site.

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2018 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.

Republish

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.