BeAts takes in as input raw waveform, annotated with the speech act class labels, and transcribed Bengali to English translation text input. The data undergoes positional embedding and is fed as input to the transformer architectures for sequence modeling task. The respective outputs are fed to a multimodality fusion block comprising of two separate schemes (i) an optimal transport kernel (OTK) based attention, and (ii) a multimodal fusion transformer. The output of this fusion block is passed through fully connected layers for classification task.
Spoken languages often utilise intonation, rhythm, inten- sity, and structure, to communicate intention, which can be interpreted differently depending on the rhythm of speech of their utterance. These speech acts provide the foundation of communication and are unique in expres- sion to the language. Recent advancements in attention- based models, demonstrating their ability to learn power- ful representations from multilingual datasets, have per- formed well in speech tasks and are ideal to model specific tasks in low resource languages. Here, we develop a novel multimodal approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, by using multimodal attention fusion to predict speech acts in our prepared Bengali speech corpus. We also show that our model BeAts (Bengali speech acts recognition using Multimodal Attention Fusion) significantly outperforms both the unimodal baseline using only speech data and a simpler bimodal fusion using both speech and text data.
@inproceedings{deb2023beats,
title={BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion},
author={Deb, Ahana and Nag, Sayan and Mahapatra, Ayan and Chattopadhyay, Soumitri and Marik, Aritra and Gayen, Pijush Kanti and Sanyal, Shankha and Banerjee, Archi and Karmakar, Samir},
booktitle={Interspeech},
year={2023}
}