Researchers find a way to make photos and muted videos ‘speak’ – here’s what it could mean for your privacy
Capturing audio from a still image may feel like something out of a science fiction novel, but one scientist has actually figured out a way to do it, with the helping hand of AI.
By creating a machine learning tool called Side Eye, a team led by Northeastern University professor of electrical and computer engineering and computer science Kevin Fu can read images to an extraordinary degree.
By applying Side Eye to a still image, they can determine the gender of a speaker in the room, where the photo was taken and the words they spoke, according to TechXplore. They can also apply the tool to muted videos.
An AI-powered privacy nightmare?
“Imagine someone makes a TikTok video and mutes it and dubs music,” Fu told the publication. “Have you ever been curious about what they’re really saying? Was it ‘Watermelon watermelon’ or ‘Here’s my password?’ Was someone talking behind them? You can actually pick up what’s being spoken off camera.”
The machine learning-powered Side Eye uses image stabilization technology that is universally used in almost all smartphone cameras.
Cameras built into smartphones have springs to suspend the lens in liquid, meaning photos won’t be blurred or out of focus due to someone’s unsteady grip. Sensors and an electromagnet work together to push the lens in the opposite direction of the vibrations being applied to stabilize the image.
When someone speaks near the camera lens while the photo is being taken, it creates small vibrations in the feathers and subtly bends the light. Although it would be virtually impossible to extract the sound frequency from these vibrations, this is made easy thanks to the rolling shutter method of photography that most cameras use.
“Basically the way cameras work today to reduce costs is that they don’t scan all the pixels of an image at the same time, but row by row,” Fu added. “(That happens) hundreds of thousands of times in a single photo. What this basically means is that you can amplify more than a thousand times how much frequency information you can get, essentially the granularity of the audio.”
While Side Eye itself has a very basic form factor and requires a lot more training data to refine and perfect, it could pose a cybersecurity nightmare for many if a more advanced form of the system falls into the wrong hands.
But there are also positive implications for the technology, especially if a much more advanced form of Side Eye is used as a kind of digital evidence for those investigating crime.