AI video generators like OpenAI’s Sora, Luma AI’s Dream Machine, and Runway Gen-3 Alpha have stolen headlines lately, but a new Google DeepMind tool could fix the one weakness they all share: a lack of accompanying audio.
A new Google DeepMind message has unveiled a new video-to-audio (or ‘V2A’) tool that uses a combination of pixels and text prompts to automatically generate soundtracks and soundscapes for AI-generated videos. In short, it’s another big step towards creating fully automated movie scenes.
As you’ll see in the videos below, this V2A technology can be combined with AI video generators (including Google’s Veo) to create an atmospheric score, topical sound effects, or even dialogue that Google DeepMind says “fits the characters and tone of a video”.
Creators aren’t stuck with one audio option either: DeepMind’s new V2A tool can apparently generate an “unlimited number of soundtracks for any video input” for any scene, meaning you can tweak it to your desired result with a few simple text prompts can send .
Google says its tool stands out from competing technology thanks to its ability to generate audio purely from pixels – providing an accompanying text prompt is apparently purely optional. But DeepMind is also very aware of the high potential for abuse and deepfakes, which is why this V2A tool is reserved as a research project for the time being.
DeepMind says that “before we consider opening access to it to the general public, our V2A technology will undergo rigorous security assessments and testing.” It will certainly have to be rigorous, because the ten short video examples show that the technology has explosive potential, both for good and for bad.
The potential for amateur filmmaking and animation is enormous, as evidenced by the ‘horror’ clip below and one for a cartoon baby dinosaur. a Blade Runner-like scene (below) showing cars skidding through a city to an electronic music soundtrack also shows how this could drastically reduce budgets for science fiction films.
Concerned creators will at least take some solace in the obvious dialogue limitations seen in the “Claymation family” video. But if the past year has taught us anything, it’s that DeepMind’s V2A technology will only improve dramatically from here.
Where we’re going, we don’t need voice actors
The combination of AI-generated videos with AI-created soundtracks and sound effects is a game changer on many levels – adding a new dimension to an already white-hot arms race.
OpenAI has already said it plans to add audio to its Sora video generator, launching later this year. But DeepMind’s new V2A tool shows that the technology is already at an advanced stage and can create audio purely from videos alone, instead of needing endless prompts.
DeepMind’s tool works using a diffusion model that combines information from the video’s pixels and the user’s text prompts, then spits out compressed audio that is then decoded into an audio waveform. It was apparently trained on a combination of video, audio, and AI-generated annotations.
Exactly what content this V2A tool is trained on isn’t clear, but Google clearly has a potentially huge advantage as the owner of the world’s largest video-sharing platform, YouTube. Neither YouTube nor his terms of service are completely clear on how the videos can be used to train AI, but YouTube CEO Neal Mohan recently told Bloomberg that some creators have contracts that allow their content to be used to train AI models.
Clearly, the technology still has some limitations in terms of dialogue and is still a long way from producing a Hollywood-ready finished article. But it’s already a potentially powerful tool for storyboarding and amateur filmmakers, and strong competition with the likes of OpenAI means it’ll only improve rapidly from here.