YouTube uses AI to caption video sound effects

Last month, YouTube announced automatic video captions that are machine-generated, and now it has announced a new type of automatic captioning that builds upon that — the captioning of sound effects within a video. The idea here is to give those who are deaf or hard of hearing the ability to perceive the non-spoken parts of a video by writing out the sound effects that happen in between.

For the average person, sound effects in a video are expected and rarely consciously acknowledged — they're the spices that shape a meal, each perceived as part of the whole product. For someone who is deaf or hard of hearing, though, a video may be lacking if the only sounds perceived are spoken words. In some cases, sound effects are an important part of the scene's action and not readily discernible.

YouTube is addressing that by introduced what it says is its first ever sound effect captioning machine. The system is automatic, and involves training a neural network using thousands of hours of video. The system is described as still being in its early stage, and being capable of noting things like [LAUGHTER], [MUSIC], and [APPLAUSE].

You can see an example of the company's automatic captioning technology in the video above. Other sounds will be added in the future, possibly including other common things like someone knocking on a door or a bell ringing. YouTube will continue working to improve its system over time.

SOURCE: YouTube Blog