Automatic speech recognition rivals humans in noisy environments

Automatic speech recognition (ASR) has made incredible advances in the past few years, especially for widely spoken languages such as English. Prior to 2020, it was typically assumed that human abilities for speech recognition far exceeded automatic systems, yet some current systems have started to match human performance.

The goal in developing ASR systems has always been to lower the error rate, regardless of how people perform in the same environment. After all, not even people will recognize speech with 100% accuracy in a noisy environment.

In a new study, UZH computational linguistics specialist Eleanor Chodroff and a fellow researcher from Cambridge University, Chloe Patman, compared two popular ASR systems—Meta’s wav2vec 2.0 and Open AI’s Whisper—against native British English listeners. They tested how well the systems recognized speech in speech-shaped noise (a static noise) or pub noise, and produced it with or without a cotton face mask.

The study is published in the journal JASA Express Letters.

Latest OpenAI system better—with one exception

The researchers found that humans still maintained the edge against both ASR systems. However, OpenAI’s most recent large ASR system, Whisper large-v3, significantly outperformed human listeners in all tested conditions except naturalistic pub noise, where it was merely on par with humans. Whisper large-v3 has thus demonstrated its ability to process the acoustic properties of speech and successfully map it to the intended message (i.e., the sentence).

“This was impressive as the tested sentences were presented out of context, and it was difficult to predict any one word from the preceding words,” Chodroff says.

Vast training data

A closer look at the ASR systems and how they’ve been trained shows that humans are nevertheless doing something remarkable. Both tested systems involve deep learning, but the most competitive system, Whisper, requires an incredible amount of training data.

Meta’s wav2vec 2.0 was trained on 960 hours (or 40 days) of English audio data, while the default Whisper system was trained on over 75 years of speech data. The system that actually outperformed human ability was trained on over 500 years of nonstop speech.

“Humans are capable of matching this performance in just a handful of years,” says Chodroff. “Considerable challenges also remain for automatic speech recognition in almost all other languages.”

Different types of errors

The paper also reveals that humans and ASR systems make different types of errors. English listeners almost always produced grammatical sentences, but were more likely to write sentence fragments, as opposed to trying to provide a written word for each part of the spoken sentence.

Using AI to analyze large amounts of biological data

In Tech

on 5 May 20228 min read

In contrast, wav2vec 2.0 frequently produced gibberish in the most difficult conditions. Whisper also tended to produce full grammatical sentences, but was more likely to “fill in the gaps” with completely wrong information.

More information:
Chloe Patman et al, Speech recognition in adverse conditions by humans and machines, JASA Express Letters (2024). DOI: 10.1121/10.0032473

Provided by
University of Zurich

Citation:
Bar chatter: Automatic speech recognition rivals humans in noisy environments (2025, January 14)

Automatic speech recognition rivals humans in noisy environments

Latest OpenAI system better—with one exception

Vast training data

Different types of errors

Subscribe