Google Research engineers have developed a deep learning system that can separate voices from audio-visual data recorded in crowded environments.
The system they developed is the equivalent of the “cocktail party” effect, a feature of the human brain that can isolate and focus on one or more particular voices in a crowd.
The system needs both audio and video inputs
The system is designed to work with both audio and video data at the same time. Google says it created its novel tech by feeding it over 100,000 high-quality videos of lectures and talks hosted on YouTube.
All talks were given by a single speaker, with minimal background noise. They trained the AI to recognize sounds based on lip/mouth movement.
Researchers then moved to the next step of the training program by mixing different talks together to create synthetic cocktail parties, along with non-speech background data, to make it harder for the AI to distinguish voices.
The result was a system that could be used to isolate voices in environments with multiple humans talking. The only condition is that the talking person’s face must be visible on screen, so the AI can correlate one of the multiple voice tracks to a certain face and prioritize it over the rest.
Google expects to deploy this tech in its products
This Google-developed system has yielded spectacular results, and the company expects to use it for various types of products in the future.
“We envision a wide range of applications for this technology,” Google said. “We are currently exploring opportunities for incorporating it into various Google products.”
For example, this tech could be used to enhance the speech recognition prowess of home assistant/smart speaker tech (like Google Home), show real-time text captions inside Google Glasses for deaf persons, improve YouTube’s text captioning system for videos with loud crowd noise, or show real-time captions inside video conferencing software.
Furthermore, this tech has applicability far beyond Google products. The system can also be deployed with CCTV systems to aid authorities isolate a single person’s voice inside noisy audio tracks recorded by video surveillance cameras.
The researchers have detailed their findings on the project’s website and in a research paper named “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation.” The research team has also released some pretty spectacular videos showing off this new system’s capabilities in isolating voices.