Machine learning helps translate lost languages

Researchers at MIT have created a new system that uses machine learning to help linguists decipher languages that have been lost to time. Research suggests that most languages that have ever existed are no longer spoken, with dozens of dead languages considered to be undeciphered. Linguists don't know enough about the grammar, vocabulary, and syntax to understand texts left behind in these lost languages.

The challenges that linguists face are many, including that many of these lost languages don't have a well-researched relative language that can be compared to. Some also lack dividers like whitespace and punctuation. The MIT Computer Science and Artificial Intelligence Laboratory recently made a breakthrough in deciphering lost languages.

Researchers created a new system that has been able to automatically decipher a lost language without requiring advanced knowledge of its relation to other languages. The system can determine relationships between languages, and recently, it was used to suggest the language of Iberian is not related to Basque, as believed by some linguists. Scientists on the project have an ultimate goal of being able to decipher languages that have baffled linguists using only a few thousand words.

Project leader Regina Barzilay says the system relies on seven principles based on historical linguistics insights. Those principles hold that languages generally only evolve in predictable ways. Languages rarely add or delete an entire sound, and sound substitutions are likely to occur. For instance, a word with a "P" in a parent language can change into a "B" in a descendent language, but it's unlikely to change to a "K" due to the pronunciation gap.

Using those linguistic constraints, the researchers at MIT developed a decipherment algorithm able to handle the vast space of possible transformations. The algorithm learns to embed language sounds in a multidimensional space where pronunciation differences are reflected in the distance between corresponding vectors. The model aims to segment words in an ancient language and map them to counterparts in a related language.