Have you ever wondered what Mona Lisa would look like when you rapped? Microsoft launches VASA-1 AI bot that can make images talk – with eerily realistic results

By William On Apr 18, 2024

The line between what is real and what is not is becoming increasingly thinner thanks to a new AI tool from Microsoft.

The technology, called VASA-1, transforms a still image of a person’s face into an animated clip of him or her talking or singing.

Lip movements are “excellently synchronized” with audio, making it seem as if the subject has come to life, the tech giant claims.

In one example, Leonardo da Vinci’s 16th century masterpiece “The Mona Lisa” begins rapping crudely in an American accent.

However, Microsoft admits that the tool “could be misused to impersonate a human” and is not releasing it to the public.

Microsoft’s new VASA-1 tool can generate clips of people talking from a still image and audio of someone talking – but the tech giant isn’t releasing it anytime soon

VASA-1 creates a static image of a face – whether it is a photo of a real person or a work of art or drawing of a fictional person.

This is then ‘meticulously’ combined with speech sounds ‘from whoever’ to make the face come to life.

The AI is trained with a library of facial expressions, so the still image can even be animated in real time, so while the audio is being spoken.

In a blog postMicrosoft researchers describe VASA as a “framework for generating lifelike talking faces from virtual characters.”

“It paves the way for real-time interactions with lifelike avatars that mimic human conversational behavior,” they say.

‘Our method is capable not only of producing precious lip audio synchronization, but also of capturing a large spectrum of emotions and expressive facial nuances and natural head movements that contribute to the perception of realism and liveliness.’

In terms of use cases, the team thinks VASA-1 could enable digital AI avatars to “engage with us in ways that are as natural and intuitive as interactions with real people.”

But experts have shared concerns about the technology, which if released could cause people to say things they never said.

VASA-1 requires a static image of a face – whether it is a photo of a real person or a work of art or drawing of an imaginary person. It links this ‘meticulously’ with speech sound ‘whoever’s’ to make the face come to life

Microsoft’s team said VASA-1 ‘is not intended to create content used to mislead or defraud’

Another potential risk is fraud, as people can be fooled online by a fake message of the image of someone they trust.

Jake Moore, a security specialist at ESET, said “seeing is definitely not believing anymore.”

‘As this technology improves, it is a race against time to ensure everyone is fully aware of what is possible and they have to think twice before accepting correspondence as authentic,’ he told MailOnline.

Anticipating concerns the public might have, Microsoft experts said VASA-1 “is not intended to create content that is used to mislead or defraud.”

“However, like other related content generation techniques, it can still be potentially abused to impersonate humans,” they add.

‘We oppose any behavior that creates misleading or harmful content from real people, and are interested in applying our technique to aid counterfeiting detection.

“Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there is still a gap to achieve the authenticity of real videos.”

Microsoft admits that existing techniques are still far from ‘achieving the authenticity of natural talking faces’, but the possibilities of AI are growing rapidly.

Regardless of the face in the image, the tool can form realistic facial expressions that match the sounds of the spoken words

According to researchers at the Australian National University, fake faces made by AI appear more realistic than human faces.

These experts warned that AI images of people tend to have a ‘hyper-realism’, with faces that are more in proportion, and people take this as a sign of humanity.

Another study by experts at Lancaster University found that fake AI faces appear more trustworthy, which has implications for online privacy.

Meanwhile, OpenAI, maker of the famous ChatGPT bot, introduced its “scary” text-to-video tool Sora in February, which can create ultra-realistic AI video clips based solely on short, descriptive text prompts.

This frame of an AI-generated video of Tokyo, taken by OpenAI’s Sora, shocked experts with its ‘terrifying’ realism

In response to the question ‘a cat that wakes its sleeping owner and demands breakfast’, Sora sent this film back

A special page on OpenAI’s website features a rich gallery of AI-created films, from a man walking on a treadmill to reflections in the windows of a moving train and a cat waking its owner.

However, experts warned it could wipe out entire industries, such as film production, and lead to a rise in deep fake videos in the run-up to the US presidential election.

“The idea that an AI can create a hyper-realistic video of, say, a politician doing something untoward should ring alarm bells as we enter the most election year in human history,” says Dr Andrew Rogoyski of the University of Surrey. .

A research paper has been published describing Microsoft’s new tool published as a pre-print.

Four of these faces are produced entirely by AI… can you tell which one is real? New research shows that almost 40% of people were wrong

Recognizing the difference between a real photo and an AI-generated image is becoming increasingly difficult as deepfake technology becomes more realistic.

Researchers from the University of Waterloo in Canada wanted to determine whether humans can distinguish AI images from real ones.

They asked 260 participants to label ten images collected by a Google search and ten images generated by Stable Diffusion or DALL-E – two AI programs used to create deepfake images – as real or fake.

The researchers noted that they expected 85 percent of participants to be able to identify the images accurately, but only 61 percent of people guessed correctly.