Researchers from MIT, Microsoft and Adobe have developed an algorithm to extract audio information from video footage.
They recovered intelligible speech from a video of a potato chip bag that was shot at a distance of 15 feet behind soundproof glass. The team also obtained useful audio signals from videos of aluminum foil, the surface of a glass of water, and the leaves of a potted plant.
The researchers, led by Abe Davis, a graduate student at MIT, will present their findings at the Siggraph computer graphics conference to be held next week in Vancouver, Canada.
About the Research
Objects vibrate when hit by sound, creating a subtle visual signal that is usually invisible to the naked eye.
The researchers sought to reconstruct the sounds causing such vibrations through the use of high-speed video cameras, lots of image filters, and an algorithm they developed.
They used a high-speed camera that captured 2,000 to 6.000 frames per second in some of their experiments.
In other experiments, the team used an ordinary digital camera. Standard video cams capture up to 60 frames per second, as do the videocams in some smartphones.
The researchers’ method could help identify the gender of a speaker in a room, the number of speakers present, and possibly the speakers’ identities.
How the Technology Works
The motions the researchers captured measured about one-tenth of a micrometer. That corresponds to five one-thousandths of a pixel in a close-up image.
A human hair is about 90 micrometers wide.
When an image moves, even slightly, the colors of successive frames on a high-speed video change.
That color shift contains information about the size and duration of the vibration.
MIT researchers in 2012 developed an algorithm that amplifies variations imperceptible to the naked eye in successive frames of video.
That amplification enables the analysis of video passed through image filters, which are used to measure fluctuations such as the changing color values at boundaries, as well as at different orientations — horizontal, vertical and diagonal — and at different scales of measurement.
The team led by Davis developed an algorithm for use with high-speed videos that combines the filters’ output to infer the motions of an object as a whole when it’s struck by sound waves. Different edges of such an object may move in different directions.
A variation of that algorithm was developed for analyzing conventional video.
“It’s interesting technology, but with very limited application — at least at the moment,” Jim McGregor, principal analyst at Tirias Research, told TechNewsWorld. “The only application that comes to mind is surveillance.”
The researchers have suggested the technology might be used by law enforcement agencies or in forensics.
Still, there’s a possibility the algorithms might have applications in the medical or manufacturing fields, “where you can use the information to determine certain patterns or characteristics other than speech,” McGregor speculated.
Possible Unwelcome Scenarios
Given the widespread use of high-tech for surveillance by the United States National Security Agency, the CIA and law enforcement, it might come as no surprise if this technology were employed for surveillance, with or without legal authority.
Accounts of the CIA hacking into the Senate Intelligence Committee’s computers have been making the rounds. Also, civil liberties groups have expressed concern over reports that the U.S. Department of Justice has worked with local law enforcement agencies nationwide to clamp down on public records requests concerning the controversial Stingray cellphone monitoring technology.
“I think there’s genuine cause for concern, but I’m not sure that formal regulations will do much to curtail improper use,” Charles King, principal analyst at Pund-IT, told TechNewsWorld.
“For every potentially abusive case, you could probably describe a completely legitimate application,” he argued. “How does one craft laws that guard against abuse while leaving the door open for use by law enforcement?”