Google DeepMind AI achieves near-human level speech capabilities

DeepMind, the Google artificial intelligence division behind the champion-defeating AlphaGo bot, has revealed that it's managed to create some of the most realistic, human level speech ever achieved from a machine. Called WaveNet, the new AI is said to act as a deep neural network that's capable of generating speech by sampling real human speech and forming raw audio waveforms.

Testing among English and Mandarin Chinese listeners has found that WaveNet is already better than existing text-to-speech systems, but still just short of being as convincing as a real human's speech.

Current text-to-speech programs work in one of two ways; the first is a human-sounding voice that speaks via recordings of actual speech that have been broken up into tiny pieces and rearranged — a bit like a ransom letter. The other relies on a computer-generated voice that has been programmed with rules on grammar and sounds, meaning it doesn't need pre-recorded recorded material, but in turn comes out very robotic sounding.

WaveNet, on the other hand, still uses real voice input, but it learns and mimics this speech rather than cutting up recordings. "A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity," the project's researchers wrote.

Just as impressive is the fact that it can apply things like mouth movements and artificial breaths in order to simulate inflections, emotions, and accents. And if that's not enough, the AI works just as well with piano music too, with the researchers feeding it a number of classical pieces, enabling it to create its own compositions.

WaveNet is still a long ways off from powering Google's apps and voice assistant, but you listen to a number of samples posted with DeepMind's announcement.

SOURCE DeepMind, Google 1, 2