Multimodal data fusion is an important aspect of many object localization and tracking frameworks that rely on sensory observations from different sources. A prominent example is audiovisual speaker localization, where the incorporation of visual information has shown to benefit overall performance, especially in adverse acoustic conditions. Recently, the notion of dynamic stream weights as an efficient data fusion technique has been introduced into this field. Originally proposed in the context of audiovisual automatic speech recognition, dynamic stream weights allow for effective sensorylevel data fusion on a per-frame basis, if reliability measures for the individual sensory streams are available. One of our recent studies proposes a learning framework for dynamic stream weights based on natural evolution strategies, which does not require the explicit computation of oracle information. An experimental evaluation based on recorded audiovisual sequences shows that the proposed approach outperforms conventional methods based on supervised training in terms of localization performance.
Our recent work at ICASSP 2021 additionally introduces the notion of spatial stream weighting, which shows significant further benefit.
For further information, see our recent papers:
J. Wissing, B. Boenninghoff, D. Kolossa, T. Ochiai, M. Delcroix, K. Kinoshita, T. Nakatani, S. Araki, C. Schymura: “Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain,” ICASSP 2021.
C. Schymura, D. Kolossa: “Audiovisual Speaker Tracking using Nonlinear Dynamical Systems with Dynamic Stream Weights,” IEEE Trans. Audio Speech and Language Processing, vol. 28, pp. 1065-1078, March 2020.