So what if OpenAI Sora didn’t create the stunning Balloon Head video without help – I still think it’s incredible

By James On Apr 28, 2024

Sora fans just learned a hard lesson: filmmakers will be filmmakers and will do whatever it takes to make their creations as convincing and dazzling as possible. But if that made them think less about OpenAI’s generative AI video platform, they’re wrong.

When OpenAI handed an early version of its generative Video AI platform to some creatives, one team – Shy Kids – created an unforgettable video of a man with a yellow balloon for a head. Many stated Air head to be a strange and powerful breakthrough, but a behind the scenes video has given it a completely different twist. And it turns out that as good as Sora is at generating video from test prompts, there were a lot of things the platform couldn’t do or didn’t produce the way the filmmakers wanted.

The video’s post-production editor, Patrick Cederberg, chimed in an interview with FxGuidea long list of changes the Cederberg team made to Sora’s output to create the stunning effects we saw in the 1 minute and 22 second finale Air head video.

For example, Sora’s developers didn’t understand typical cinematic shots like panning, tracking, and zooming, so the team sometimes had to create a pan and tilt shot from the existing, more static clip.

Furthermore, while Sora is capable of producing long videos from long text prompts, there is no guarantee that the topics in each prompt will remain consistent from one output clip to the next. It took a lot of work and experimentation to get videos that connected different shots into a semi-connected whole.

As Cederberg notes in an Air Head Behind the Scenes video, “What you see in the end took work and human hands to make it look semi-consistent.”

The balloon head sounds particularly challenging, because Sora understands the idea of a balloon, but does not base her results on, for example, an individual video or photo of a balloon. In Sora’s original idea, each balloon had a stinger attached to it; Cederberg’s team had to paint that out of every frame. Even more frustrating was that Sora often wanted to put the print (see above), the outline or the drawing of a face on the balloons. And while the final video shows a yellow balloon in every shot, the Sora output usually had different balloon colors that Shy Kids would adjust afterwards.

Shy Kids told FxGuide that all the video they used was Sora output, but if they had used the video untouched, the film would have lacked the continuity and coherence of the final, wistful product.

This is good news

Does this news change the charming Shy Kids video to Sora’s? Milkshake Duck? Not necessary.

If you look at some of the unretouched videos and images in the Behind the Scenes video they are still remarkable and while post-production was necessary, Shy Kids never shot a single piece of actual film to produce the initial images and video .

Even as AI innovation advances and we see huge generational leaps every three months, AI at virtually every level is far from perfect. ChatGPT’s answers are usually accurate, but can still lack context and misrepresent basic facts. With text-to-images, the results are even more varied because, unlike AI-generated text responses – which can use fact-based sources and usually predict the correct next word – generative image processing bases their output on a representation of that idea or concept . . This is especially true for diffusion models that use training information to figure out what something should look like, meaning the output can vary wildly from image to image.

“It’s not as simple as a magic trick: type something in and you get exactly what you hope for,” says Shy Kids Producer Syndey Leeder in the Behind the Scenes video.

These models can have a general idea of what a balloon or person looks like. If you ask such a system to imagine a man on a bicycle six times, you will get six different results. They may all look good, but it’s unlikely that the man or the bike will be the same in every image. Video generation likely exacerbates the problem, where the chance of maintaining scene and image consistency across thousands of frames and from clip to clip is extremely low.

With that in mind, Shy Kids’ achievement is even more remarkable. Air heads manages to retain both the otherworldliness of an AI video and its cinematic essence.

This is how AI should work

Automation does not mean completely eliminating human intervention. This is as true of videos as it is of the factory floor, where the introduction of robots has not led to human-free production. I vividly remember Elon Musk’s efforts to automate as much of the production of the Tesla Model 3 as possible. It was a near disaster and production went smoother when he added the humanity back.

A creative process such as filmmaking or production will always require a human touch. Shy Kids needed an idea before they could start giving it to Sora. And when Sora didn’t understand their intentions, they had to adjust the output by hand. Like most creative endeavors, it became a partnership, one where the talented Sora AI provided a great shortcut, but still didn’t see the project through to completion.

Instead of bursting Air head‘s bubble, these revelations remind us that the marriage between traditional media and AI still requires the guiding hand of a human and is unlikely to change – at least for now.