Imagine you're playing a board game—let's say Codenames. As one team is guessing which tiles belong to their team, players on the other team engage in small side conversations. Amidst all the many conversations happening around you, you suddenly wonder: is there a way to separate all those overlapping voices so you can clearly hear each conversation?
Now, if you're a linguistics enthusiast, another thought might strike you: What if I could use technology to analyze speech sounds—specifically, vowel formants (those peaks in frequency components that make vowels sound different from each other)? Well, you’re not alone.
Researchers J. Stanley, L. Johnson, and E. Brown had the same idea. They were curious about how well current audio separation tools could isolate voices from each other, especially when social factors come into play.
To test this out, they brought in two speakers, referred to as “Olivia” and “Tyler,” and had each of them record 300 sentences (600 total) in a soundproof booth. These clean recordings were used to extract vowel formants—let’s call this set Extraction 1.
Then things got interesting: they overlapped Olivia’s and Tyler’s audio to simulate natural conversation, and ran them through three different audio separation models—Libri2mix, Whamr16K, and WSJ02mix—to try to untangle the voices. From the separated tracks, they extracted a second set of vowel formants—Extraction 2—at the same timestamps as Formant 1. They then compared Extraction 2 to the gold standard measurements from Extraction 1.
How did the models do?
- Libri2mix came out on top. It produced clean audio with low distortion and very little confusion between the speakers.
- Whamr16K came in second. It wasn’t perfect but still gave fairly clean results.
- WSJ02mix struggled the most. The audio was distorted, and it had trouble keeping Olivia and Tyler’s voices separate.
One major finding was that Olivia’s separated audio showed significant differences in her front vowels— suggesting the models aren’t just separating voices, but also subtly altering them too.
So, what does this all mean?
1. Audio separation tools could help linguists analyze speech even when voices overlap, opening up new possibilities for sociophonetic research (studying how social factors affect speech).
2. Because the voices were subtly altered in unexplainable ways, researchers should focus more on averages of vowel data rather than individual speech moments, since individual token differences were inconsistent and unpredictable.
Final thought:
Next time you're watching a movie with overlapping dialogue, think like a linguist. With the right tools—like Libri2mix—you might just be able to pull apart the voices and uncover hidden layers of meaning, both in the acoustics and in the plot.
Find the full article here.