Now that ChatGPT and Midjourney are pretty much mainstream, the next big AI race is text-to-video generators – and Nvidia has just shown off some impressive demos of the tech that could soon take your GIFs to a new level.
A new research paper and micro-site (opens in new tab) from Nvidia’s Toronto AI Lab, called “High-Resolution Video Synthesis with Latent Diffusion Models”, gives us a taste of the incredible video creation tools that are about to join the ever-growing list of the best AI art generators.
Latent Diffusion Models (or LDMs) are a type of AI that can generate videos without needing massive computing power. Nvidia says its tech does this by building on the work of text-to-image generators, in this case Stable Diffusion, and adding a “temporal dimension to the latent space diffusion model”.
In other words, its generative AI can make still images move in a realistic way and upscale them to using super-resolution techniques. This means it can produce short, 4.7-second long videos with a resolution of 1280×2048, or longer ones at the lower resolution of 512×1024 for driving videos.
Our immediate thought on seeing the early demos (like the ones above and below) is how much this could boost our GIF game. Okay, there are bigger ramifications, like the democratization of video creation and the prospect of automated film adaptations, but at this stage text-to-GIF seems to be the most exciting use case.
Simple prompts like ‘a storm trooper vacuuming on the beach’ and a ‘teddy bear is playing the electric guitar, high definition, 4K’ produce some pretty usable results, even if there are naturally artifacts and morphing with some of the creations.
Right now, that makes text-to-video tech like Nvidia’s new demos most suitable for thumbnails and GIFs. But, given the rapid improvements seen in Nvidia’s AI generation for longer scenes (opens in new tab), we probably won’t have to wait for longer text-to-video clips in stock libraries and beyond.
Analysis: The next frontier for generative AI
Nvidia isn’t the first company to show off an AI text-to-video generator. We recently saw Google Phenaki (opens in new tab) make its debut, revealing its potential for 20-second clips based on longer prompts. Its demos also show an albeit more ropey clip that’s over two minutes long.
The startup Runway, which helped created the text-to-image generator Stable Diffusion, also revealed its Gen-2 AI video model (opens in new tab) last month. Alongside responding to prompts like ‘the late afternoon sun peeking though the window of a New York City loft’ (the result of which is above), it lets you provide an still image to base the generated video on and lets you request styles to be applied to its videos, too.
The latter was also a theme of the recent demos for Adobe Firefly, which showed how much easier AI is going to make video editing. In programs like Adobe Premiere Rush, you’ll soon be able to type in the time of day or season you want to see in your video and Adobe’s AI will do the rest.
The recent demos from Nvidia, Google, and Runway show that full text-to-video generation is in a slightly more nebulous state, often creating weird, dreamy or warped results. But, for now, that’ll do nicely for our GIF game – and rapid improvements that’ll make the tech suitable for longer videos are surely just around the corner.