Researchers turn mono audio into 2.5D sound with machine learning

Brittany A. Roston - Dec 28, 2018, 4:43 pm CST
Researchers turn mono audio into 2.5D sound with machine learning

Researchers with the University of Texas at Austin and Facebook AI Research have used machine learning to transform monaural audio into binaural audio. The method involves using a video to determine object and scene configuration, the result being “2.5D visual sound” offering a more robust experience. The technology provides a method for turning common mono audio into an immersive product suited for applications like VR headsets.

Humans are capable of perceiving the distance and location of noisy things in 3D space thanks to the combination of two ears and the distance between them. Different elements help listeners discern the direction and distance of noise-producing objects, such as how loud that noise is and which ear it reaches first.

This so-called 3D audio experience can be replicated by recording audio using a binaural setup, which uses two microphones placed at approximately the same distance as human ears. The resulting single audio file — when listened to with headphones — provides realistic, immersive audio that makes it possible to perceive the objects within 3D space.

The majority of audio is monaural, though, meaning it was recorded with a single microphone from a single location. Though adequate, mono audio doesn’t capture the effects that would enable humans to perceive the distance and location of objects, resulting in a less realistic, immersive product.

Transforming mono audio into binaural audio has more or less been impossible, but researchers Ruohan Gao and Kristen Grauman have found a method to get close — it uses deep learning and produces what they call “2.5D” audio. The method relies on a related video, which is processed for visual cues that can be combined with the audio to adjust levels, simulating the position of the noise-producing objects within the 3D space.

There are some limitations to the method, particularly that it can’t account for any object that isn’t visually present within the video. An example of the 2.5D audio output is provided in the video above, but you’ll need a pair of headphones to perceive it.

Must Read Bits & Bytes