Google deep learning audio-visual model can pinpoint one voice in many

Shane McGlaun - Apr 13, 2018
1
Google deep learning audio-visual model can pinpoint one voice in many

Google says that people are very good at separating out other voices and hearing the one we are looking at, this is called the cocktail party effect. Machines not so much. Google wants to make computers better at hearing the voice that we want to hear and has developed a new deep learning audio-visual model for isolating single speech signal from a mixture of sounds.

That separation can be from other voices or from background sounds. Google says that thanks to this new model, it can computationally produce videos where the speech of specific people is enhanced while other sounds are suppressed. Google’s method works on ordinary videos with a single audio track and all the user must do is select the face of the person in the video they want to hear.

Google says that its work has potential in a variety of applications including video conferencing, improved hearing aids, and other situations with multiple people speaking. The unique thing about the technique is that it combines auditory and visual signals of an input video to separate the speech. The method takes the movements of a person’s mouth and correlates that with sounds produced as the person is speaking. That allows the model to figure out which part of the audio goes with that person.

This allows the separated clean speech tracks within the video to be separated out. Google generated training examples using a collection 100,000 high-quality lectures and talks from YouTube. It was able to extract segments with clean speech, meaning no mixed music, audience sounds, or other speakers, with a single speaker visible in the frames.

That data allowed Google to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The training allows the network to separate encodings for each speaker and that joint representation allows the network to learn to output a time-frequency mask for each speaker. Those are then multiplied by noisy input spectrum and converted back to a time-domain waveform to obtain the desired isolated clean speech signal.

SOURCE: Google


Must Read Bits & Bytes