Gated Cross-Modal Fusion Mechanism for Audio-Video-based Emotion Recognition
Received: 28 October 2024 | Revised: 22 November 2024 | Accepted: 27 November 2024 | Online: 3 April 2025
Corresponding author: Himanshu Kumar
Abstract
Due to its potential uses in security, surveillance, mental health monitoring, and human-computer interaction, artificial emotion recognition employing video and audio modalities has attracted a lot of attention. This study focuses on optimal cross-modal fusion techniques to enhance the precision and robustness of multimodal audio-video-based emotion recognition. Specifically, this study introduces a gated cross-modal fusion mechanism in audio-video-based emotion recognition, known as Compact Bilinear Gated Pooling (CBGP). The novelty of this work is that CBGP fusion is being applied to the emotion recognition task for the first time to integrate the extracted features and reduce the dimensionality of the audio and video modalities using 1DCNN and 3DCNN deep neural architectures, respectively. This novel approach was tested and verified on three benchmark datasets: CMU-MOSEI, RAVDESS, and IEMOCAP, each containing multimodal data representing a range of emotions, including happy, sad, fear, anger, neutral, and disgust. Experimental results show that CBGP consistently outperformed state-of-the-art fusion techniques, such as early fusion, late fusion, hybrid fusion, and others. CBGP extracts the relevant features, leading to higher accuracy and F1 scores due to its dynamic gating mechanism that selectively emphasizes relevant feature interactions. This study suggests that the integration of gating mechanisms within fusion processes is vital to improve emotion recognition. Future work will focus on extending these findings to real-time applications, exploring multitask learning frameworks, and enhancing the interpretability of multimodal emotion recognition systems.
Keywords:
attention mechanism, bilinear pooling, emotion recognition, feature extraction, fusion strategiesDownloads
References
C. Busso et al., "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, Dec. 2008.
S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," PLOS ONE, vol. 13, no. 5, 2018, Art. no. e0196391.
S. Sahay, S. H. Kumar, R. Xia, J. Huang, and L. Nachman, "Multimodal Relational Tensor Network for Sentiment and Emotion Classification," in Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia, 2018, pp. 20–27.
A. T. Eyam, W. M. Mohammed, and J. L. Martinez Lastra, "Emotion-Driven Analysis and Control of Human-Robot Interactions in Collaborative Applications," Sensors, vol. 21, no. 14, Jul. 2021, Art. no. 4626.
S. Saxena, S. Tripathi, and S. Tsb, "Deep Robot-Human Interaction with Facial Emotion Recognition Using Gated Recurrent Units & Robotic Process Automation," in Frontiers in Artificial Intelligence and Applications, A. J. Tallón-Ballesteros and C. H. Chen, Eds. IOS Press, 2020.
M. Amorim, Y. Cohen, J. Reis, and M. Rodrigues, "Exploring Opportunities for Artificial Emotional Intelligence in Service Production Systems," IFAC-PapersOnLine, vol. 52, no. 13, pp. 1145–1149, 2019.
S. Zhou, X. Wu, F. Jiang, Q. Huang, and C. Huang, "Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks," International Journal of Environmental Research and Public Health, vol. 20, no. 2, Jan. 2023, Art. no. 1400.
A. P. Shah, V. Vaibhav, V. Sharma, M. Al Ismail, J. Girard, and L.-P. Morency, "Multimodal Behavioral Markers Exploring Suicidal Intent in Social Media Videos," in 2019 International Conference on Multimodal Interaction, Suzhou, China, Oct. 2019, pp. 409–413.
S. C. Venkateswarlu, S. R. Jeevakala, N. U. Kumar, P. Munaswamy, and D. Pendyala, "Emotion Recognition From Speech and Text using Long Short-Term Memory," Engineering, Technology & Applied Science Research, vol. 13, no. 4, pp. 11166–11169, Aug. 2023.
R. Madhok, S. Goel, and S. Garg, "SentiMozart: Music Generation based on Emotions:," in Proceedings of the 10th International Conference on Agents and Artificial Intelligence, Funchal, Madeira, Portugal, 2018, pp. 501–506.
J. Cho and H. Hwang, "Spatio-Temporal Representation of an Electoencephalogram for Emotion Recognition Using a Three-Dimensional Convolutional Neural Network," Sensors, vol. 20, no. 12, Jun. 2020, Art. no. 3491.
O. El Hammoumi, F. Benmarrakchi, N. Ouherrou, J. El Kafi, and A. El Hore, "Emotion Recognition in E-learning Systems," in 2018 6th International Conference on Multimedia Computing and Systems (ICMCS), Rabat, Morocco, May 2018, pp. 1–6.
D. Lakshmi and R. Ponnusamy, "Facial emotion recognition using modified HOG and LBP features with deep stacked autoencoders," Microprocessors and Microsystems, vol. 82, Apr. 2021, Art. no. 103834.
D. Hu, C. Chen, P. Zhang, J. Li, Y. Yan, and Q. Zhao, "A Two-Stage Attention Based Modality Fusion Framework for Multi-Modal Speech Emotion Recognition," IEICE Transactions on Information and Systems, vol. E104.D, no. 8, pp. 1391–1394, Aug. 2021.
Z. Lian, B. Liu, and J. Tao, "CTNet: Conversational Transformer Network for Emotion Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 985–1000, 2021.
I. M. Revina and W. R. S. Emmanuel, "A Survey on Human Face Expression Recognition Techniques," Journal of King Saud University - Computer and Information Sciences, vol. 33, no. 6, pp. 619–628, Jul. 2021.
W. Dai, S. Cahyawijaya, Z. Liu, and P. Fung, "Multimodal End-to-End Sparse Model for Emotion Recognition," in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5305–5316.
J. D. S. Ortega, M. Senoussaoui, E. Granger, M. Pedersoli, P. Cardinal, and A. L. Koerich, "Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition." arXiv, Jul. 06, 2019.
A. Petrova, D. Vaufreydaz, and P. Dessus, "Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach," in Proceedings of the 2020 International Conference on Multimodal Interaction, Oct. 2020, pp. 813–820.
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, "A review of affective computing: From unimodal analysis to multimodal fusion," Information Fusion, vol. 37, pp. 98–125, Sep. 2017.
W. Ismaiel, A. Alhalangy, A. O. Y. Mohamed, and A. I. A. Musa, "Deep Learning, Ensemble and Supervised Machine Learning for Arabic Speech Emotion Recognition," Engineering, Technology & Applied Science Research, vol. 14, no. 2, pp. 13757–13764, Apr. 2024.
E. Avots, T. Sapiński, M. Bachmann, and D. Kamińska, "Audiovisual emotion recognition in wild," Machine Vision and Applications, vol. 30, no. 5, pp. 975–985, Jul. 2019.
M. A. H. Akhand, S. Roy, N. Siddique, M. A. S. Kamal, and T. Shimamura, "Facial Emotion Recognition Using Transfer Learning in the Deep CNN," Electronics, vol. 10, no. 9, Apr. 2021, Art. no. 1036.
N. Hajarolasvadi and H. Demirel, "Deep facial emotion recognition in video using eigenframes," IET Image Processing, vol. 14, no. 14, pp. 3536–3546, Dec. 2020.
K. H. Cheah, H. Nisar, V. V. Yap, C. Y. Lee, and G. R. Sinha, "Optimizing Residual Networks and VGG for Classification of EEG Signals: Identifying Ideal Channels for Emotion Recognition," Journal of Healthcare Engineering, vol. 2021, pp. 1–14, Mar. 2021.
M. Quiroz, R. Patiño, J. Diaz-Amado, and Y. Cardinale, "Group Emotion Detection Based on Social Robot Perception," Sensors, vol. 22, no. 10, May 2022, Art. no. 3749.
K. L. Lakshmi et al., "Recognition of emotions in speech using deep CNN and RESNET," Soft Computing, Mar. 2023.
I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens, "Attention Augmented Convolutional Networks," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), Oct. 2019, pp. 3285–3294.
M. Zielonka, A. Piastowski, A. Czyżewski, P. Nadachowski, M. Operlejn, and K. Kaczor, "Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets," Electronics, vol. 11, no. 22, Nov. 2022, Art. no. 3831.
Q. Liu, L. Dong, Z. Zeng, W. Zhu, Y. Zhu, and C. Meng, "SSD with multi-scale feature fusion and attention mechanism," Scientific Reports, vol. 13, no. 1, Dec. 2023, Art. no. 21387.
H. J. Li, Z. Wang, J. Pei, J. Cao, and Y. Shi, "Optimal estimation of low-rank factors via feature level data fusion of multiplex signal systems," IEEE Transactions on Knowledge and Data Engineering, 2020.
D. Hazarika, S. Gorantla, S. Poria, and R. Zimmermann, "Self-Attentive Feature-Level Fusion for Multimodal Emotion Detection," in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, Apr. 2018, pp. 196–201.
C. Salazar, E. Montoya-Múnera, and J. Aguilar, "Analysis of different affective state multimodal recognition approaches with missing data-oriented to virtual learning environments," Heliyon, vol. 7, no. 6, Jun. 2021, Art. no. e07253.
N. H. Ho, H. J. Yang, S. H. Kim, and G. Lee, "Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network," IEEE Access, vol. 8, pp. 61672–61686, 2020.
A. Ghorbanali and M. K. Sohrabi, "A comprehensive survey on deep learning-based approaches for multimodal sentiment analysis," Artificial Intelligence Review, vol. 56, no. S1, pp. 1479–1512, Oct. 2023.
Y. Cimtay, E. Ekmekcioglu, and S. Caglar-Ozhan, "Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion," IEEE Access, vol. 8, pp. 168865–168878, 2020.
H. Zhu, Z. Wang, Y. Shi, Y. Hua, G. Xu, and L. Deng, "Multimodal Fusion Method Based on Self-Attention Mechanism," Wireless Communications and Mobile Computing, vol. 2020, pp. 1–8, Sep. 2020.
D. Sharma, M. Jayabalan, N. Sultanova, J. Mustafina, and D. N. L. Yao, "Multimodal Emotion Recognition Using Attention-Based Model with Language, Audio, and Video Modalities," in Data Science and Emerging Technologies, vol. 191, Y. Bee Wah, D. Al-Jumeily Obe, and M. W. Berry, Eds. Springer Nature Singapore, 2024, pp. 193–210.
O. Buza, G. Toderean, and J. Domokos, "A rule-based approach to build a text-to-speech system for Romanian," in 2010 8th International Conference on Communications, Bucharest, Romania, Jun. 2010, pp. 83–86.
Y. Huang, J. Yang, P. Liao, and J. Pan, "Fusion of Facial Expressions and EEG for Multimodal Emotion Recognition," Computational Intelligence and Neuroscience, vol. 2017, pp. 1–8, 2017.
A. M. Aubaid and A. Mishra, "A Rule-Based Approach to Embedding Techniques for Text Document Classification," Applied Sciences, vol. 10, no. 11, Jun. 2020, Art. no. 4009.
Z. Fu et al., "A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition." arXiv, Nov. 03, 2021.
E. S. Salama, R. A. El-Khoribi, M. E. Shoman, and M. A. Wahby Shalaby, "A 3D-convolutional neural network framework with ensemble learning techniques for multi-modal emotion recognition," Egyptian Informatics Journal, vol. 22, no. 2, pp. 167–176, Jul. 2021.
Downloads
How to Cite
License
Copyright (c) 2025 Himanshu Kumar, Martin Aruldoss

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.