StethoSpeech: Turning non-audible murmurs into clear intelligible speech

Researchers at IIIT Hyderabad have developed technology that converts murmurs into understandable speech using a wireless stethoscope.

Published Oct 23, 2024 | 7:00 AMUpdated Oct 23, 2024 | 7:00 AM

Turning Non-Audible Murmurs into Clear Voices

Imagine if you could talk without making any sound and a special device helps others hear you! That’s what some smart researchers from International Institute of Information Technology (IIIT) Hyderabad have created.

A team of researchers from IIIT, Hyderabad has developed a Silent Speech Interface (SSI) capable of converting non-audible murmurs into vocalised speech.

This innovative technology, known as “StethoSpeech,” uses a wireless stethoscope to translate behind-the-ear vibrations into intelligible speech, even in challenging environments.

The research was led by Neil Shah, a PhD student and researcher at the Centre for Visual Information Technology (CVIT) at IIITH. With Neha Sahipjohn and Vishal Tambrahalli as team members, the research was guided by Dr. Ramanathan Subramanian and Prof. Vineet Gandhi.

Their findings were published in a paper titled “StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin.”

Also Read: Why Standing desk is not a solution to sedentary lifestyle

How it works

Silent Speech Interfaces (SSI) allow communication without producing audible sound.

Traditional SSI methods, like lip reading, are limited in scope and real-time functionality. But IIITH’s innovative approach uses a simple off-the-shelf stethoscope attached behind the ear to detect Non-Audible Murmurs (NAM) — subtle vibrations that occur during whispered speech.

These vibrations are then transmitted via Bluetooth to a mobile device, where they are converted into clear, vocalised speech through a specialised model.

The aim of the innovation was to improve social interactions for those with voice disorders. For this, the team used an off-the-shelf stethoscope attached to the skin behind the ear to convert behind-the-ear vibrations into intelligible speech.

“Such vibrations are referred to as Non-Audible Murmurs (NAM)”, Prof. Gandhi told South First.

The IIITH team curated a dataset of NAM vibrations, which they’ve labelled as the Stethotext corpus, collected under noisy conditions such as an everyday office environment, as well as high-noise scenarios, the kind experienced in a concert.

These vibrations were paired along with their corresponding text. “We asked people to read out some text – all while murmuring. So we know the text and we captured the vibrations. In this way, we trained our model to convert the vibrations into speech,” says Prof. Gandhi.

What sets it apart?

What sets IIITH’s system apart is its minimalistic design and effectiveness in a “zero-shot” setting, meaning the device can accurately convert murmurs into speech even for users whose data was not previously used to train the model.

The speech conversion happens in less than 0.3 seconds for a 10-second vibration, enabling real-time communication, even during movement.

What’s also unique about this solution is that users can exercise options for the kind of output speech they want to be heard in. For instance, they can choose ethnicity, say English spoken with a pronounced South Indian accent, gender; male or female voice and thus speech can be produced accordingly.

“We’ve also demonstrated through this research that we can build person-specific models,” remarks Prof. Gandhi.

It means that with the help of just four hours of murmuring data recorded from a person, a specialised model that converts NAM into speech can be built just for that person.

“Other researchers have also converted whispers into speech but we were able to get great accuracy in our output,” says Neil.

Also Read: Why is Andhra CM urging South Indians to have more children?

Significant breakthrough?

One other use case that the team has demonstrated is communication in high-noise environments like a rock concert where even normal speech is unintelligible.

The researchers also mention that it can come in handy to decipher discreet communication typically used by security guards like the Secret Service and others.

“Our work is a game changer in the sense that all previous studies have assumed that a clean speech is available corresponding to the vibrations one is recording. But If someone is disabled or is speech impaired, we won’t have his corresponding speech. That is the fundamental difference in our case – we don’t assume that clean speech of a speech impaired person is available in order to train our models,” explained Gandhi.

While previous works were rather experimental in nature,  the output in those cases were “nowhere close to the kind of performance our models are demonstrating in terms of clean speech,” emphasises Prof. Gandhi.

While the team has not conducted any experiments on medical patients yet, they are actively looking for collaborations with hospitals to record data from patients. “At this point, it’s super exciting to think that we can give a voice to someone who has lost their own,” muses Prof. Gandhi.

(Edited by Neena)

Follow us