Audio-Visual Speech Recognition

We are working on designing and implementing structured hybrid models for audiovisual speech processing.

Audiovisual speech recognition

The project is funded by the DFG (German Research Foundation) and is both a great scientific challenge and an opportunity for our group to develop exciting new technologies and applications for humans based on models of human signal processing.

At present, automatic speech recognition works mostly acoustically. Microphones convert pressure fluctuations into electrical signals that computer programs translate into a transcript of spoken language. In environments with low ambient noise, this works very well. This is different in extremely loud environments. It is not only machines that have difficulties with this, but people also find listening after a short time very exhausting. There is a lot of information that might get lost. People then often unconsciously direct their gaze to the speakers face. Especially the mouth movements, the position of the lower jaw, the lips and also movements at the neck can provide valuable clues to decipher what is being said. This is a successful strategy in human listening in noise, which we are adapting to automatic speech recognition within this research project.

Recent publications:

Yu, S. Zeiler, D. Kolossa: “Fusing Information Streams in End-to-End Audio-Visual Speech Recognition,” accepted for publication at ICASSP 2021.

W. Yu, S. Zeiler, D. Kolossa: “Multimodal integration for large-vocabulary audio-visual speech recognition,” Proc. European Signal Processing Conference (EUSIPCO) 2020.

In addition to speech recognition, our methods also improve numerous classical audio signal processing applications. These include noise reduction and source separation. In our research we want to understand which strategies living organisms use to combine sparate information sources, e.g. audio and video, in an optimal way. Our goal is to apply our research directly in practical applications, e.g. to improve the quality of life. An example for a medical application of this technology is the project AVATAR.


The therapy of childhood phonetic-phonological articulation disorders is often lengthy and requires intensive and sustained practice in the home environment by the affected children and parents. It is often a challenge for speech therapists to put together motivating exercises that are suitable for children. Therefore, this research project will first address this target group, with the clearly recognizable potential to transfer the results to other relevant groups such as patients with a migration background or elderly people.

The aim of the project, which is headed by Prof. Dr. Jörg Thiem, Department of Information Technology at the Dortmund University of Applied Sciences, is to develop a motivating learning environment in the form of a technical assistance system (“speech therapy assistant”) to support therapy for childhood articulation disorders. The system enables motivating, computer-based therapy exercises in a home environment using a mobile device (e.g. app on tablet/PC), which complements the regular sessions with the therapist and promotes independence and the transfer of the therapy into everyday life.

Please see our recent publications for more details:

Denoising and source separation

In addition to speech recognition, our methods can also be used to improve classic audio signal processing applications. These include, among others, denoising, speech enhancement and source separation. Two examples are given below. In the first example you can hear the difference in noise suppression between an established audio-only method and our new video enhanced method. The second example shows, for a problem of source separation that is difficult to solve with classical audio methods, the advantage of additional video information for signal processing.

Why it works

The image data do not only help in localizing the speaker, but also contain complementary information about the articulation location and segmentation of the spoken word. Until a few years ago, however, comprehensive use of video information was hardly possible due to a lack of sensor technology and the necessary compute power. With the increasing availability of multimodal and especially audio-visual voice data – whether in Internet telephony, in current smartphones, in voice- and gesture-controlled computer games, or in the many new multimedia data on the Internet – the use of video data in traditional audio signal processing and audio classification tasks is becoming viable and wide-spread.

One early example, where we have shown notable gains in speech intelligibility, can be found in our 2016 Interspeech paper: