Integrating Facial Emotion Recognition, Speech to Text Transcription, and Natural Language Processing for Customer Satisfaction Analysis from Video Reviews

Authors

  • Sudhindra B. Deshpande Department of AIML, Anuvartik Mirji Bharatesh Institute of Technology, Belagavi, Karnataka, India
  • Goh Kah Ong Michael Center for Image and Vision Computing, COE for Artificial Intelligence, Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka, Malaysia
  • Uttam U. Deshpande Department of Electronics & Communication, KLS, Gogte Institute of Technology, Belagavi, Karnataka, India
  • K. S. Mathad Department of Information Science, KLS, Gogte Institute of Technology, Belagavi, Karnataka, India
  • N. V. Karekar Department of Information Science, KLS, Gogte Institute of Technology, Belagavi, Karnataka, India
  • Kiran K. Tangod Department of CSE (Artificial Intelligence & Machine Learning), Kasegaon Education Society's Rajarambapu Institute of Technology, affiliated to Shivaji University, Sakharale, India
Volume: 16 | Issue: 2 | Pages: 34615-34622 | April 2026 | https://doi.org/10.48084/etasr.15095

Abstract

Customer satisfaction is a decisive factor in the success of products and services provided, yet conventional text-based reviews often fail to capture the full spectrum of user emotions needed to assess satisfaction. On the other hand, video product or service reviews offer a more informative medium for evaluating customer satisfaction. To leverage this, the present study proposes a multimodal machine learning framework for video-based customer feedback analysis, integrating facial emotion recognition, speech-to-text transcription, and Natural Language Processing (NLP). A dataset of 1,000 video reviews was processed through a multistage pipeline that involved frame extraction, face detection, emotion classification, audio transcription, sentiment analysis, and late fusion of modalities. Experimental results highlight the limitations of unimodal models: visual-only sentiment prediction achieved 62.3% accuracy (precision = 0.61, recall = 0.63, F1-score = 0.62, Area Under Curve (AUC) = 0.65), while audio-only sentiment prediction reached 59.5% accuracy (precision = 0.58, recall = 0.59, F1-score = 0.59, AUC = 0.61). The text-based model provided a stronger baseline at 72.1% accuracy (precision = 0.70, recall = 0.72, F1-score = 0.71, AUC = 0.75). In contrast, the multimodal fusion framework substantially outperformed unimodal approaches, achieving 79.9% accuracy, precision = 0.80, recall = 0.81, F1-score = 0.80, and the highest AUC of 0.86. Additionally, aspect-level analysis revealed that camera quality (+0.16) was the most positively perceived feature, while app performance (-0.33) and delivery (-0.09) emerged as primary concerns. Temporal analysis showed satisfaction scores fluctuating between 52.1 and 63.4 (0-100 scale) over 20 weeks, underscoring the value of continuous monitoring. These findings demonstrate that multimodal video feedback analysis yields more comprehensive, reliable, and fair performance than single-channel methods.

Keywords:

customer satisfaction, video feedback, emotion recognition, sentiment analysis, facial emotions, product feedback

Downloads

Download data is not yet available.

References

E. Marrese-Taylor, C. Rodriguez, J. Balazs, S. Gould, and Y. Matsuo, "A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews," in Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), 2020, pp. 8–18. DOI: https://doi.org/10.18653/v1/2020.challengehml-1.2

J. S. Chu and S. Ghanta, "Integrative Sentiment Analysis: Leveraging Audio, Visual, and Textual Data," in AI, Machine Learning and Applications, Jan. 2024, pp. 155–169. DOI: https://doi.org/10.5121/csit.2024.140211

N. Rane, S. Choudhary, and J. Rane, "Artificial intelligence, machine learning, and deep learning for sentiment analysis in business to enhance customer experience, loyalty, and satisfaction," SSRN Electronic Journal, 2024. DOI: https://doi.org/10.2139/ssrn.4846145

M. A. Kausar, S. O. Fageeri, and A. Soosaimanickam, "Sentiment Classification based on Machine Learning Approaches in Amazon Product Reviews," Engineering, Technology & Applied Science Research, vol. 13, no. 3, pp. 10849–10855, June 2023. DOI: https://doi.org/10.48084/etasr.5854

D. Singh, N. K. Pandey, V. Gupta, M. Prajapati, and R. Senapati, "Beyond Textual Analysis: Framework for CSAT Score Prediction with Speech and Text Emotion Features," in 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), Oct. 2024, pp. 1–6. DOI: https://doi.org/10.1109/CVMI61877.2024.10782234

D. Thamaraiselvi, J. Pranay, and S. H. Kasyap, "Emotion Detection from Video and Audio and Text," Interantional Journal of Scientific Research in Engineering and Management, vol. 09, no. 01, pp. 1–9, Jan. 2025. DOI: https://doi.org/10.55041/IJSREM40494

T. Zhang and Z. Tan, "Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey." Preprints, Nov. 2023. DOI: https://doi.org/10.1007/s11042-023-17944-9

K. Qiu, Y. Zhang, J. Zhao, S. Zhang, Q. Wang, and F. Chen, "A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism," Electronics, vol. 13, no. 10, May 2024, Art. no. 1922. DOI: https://doi.org/10.3390/electronics13101922

C. Á. Iglesias, J. F. Sánchez-Rada, P. Buitelaar, and F. Danza, "Mixed Emotions - Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets:," in European Project Space on Intelligent Technologies, Software engineering, Computer Vision, Graphics, Optics and Photonics, 2016, pp. 116–123. DOI: https://doi.org/10.5220/0007904101160123

T. I. Deepika and A. N. Sigappi, "Multi-Model Emotion Recognition from Voice Face and Text Sources using Optimized Progressive Neural Network: A Literature Review," in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Dec. 2024, pp. 1245–1253. DOI: https://doi.org/10.1109/ICACRS62842.2024.10841591

K. P. Seng and L.-M. Ang, "Video Analytics for Customer Emotion and Satisfaction at Contact Centers," IEEE Transactions on Human-Machine Systems, vol. 48, no. 3, pp. 266–278, June 2018. DOI: https://doi.org/10.1109/THMS.2017.2695613

K. Robinson, A. Martinez, and E. Turner, "Enhanced Modal Fusion Learning for Multimodal Sentiment Interpretation." Computer Science and Mathematics, Sept. 2024. DOI: https://doi.org/10.20944/preprints202409.1887.v1

H. Gu, G. Jin, Y. Zhao, and R. Cui, "Multi-task Multimodal Sentiment Analysis Based on Visual Data Mining," in 2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS), Sept. 2024, pp. 1047–1050. DOI: https://doi.org/10.1109/EIECS63941.2024.10800562

K. K. Wong, X. Wang, Y. Wang, J. He, R. Zhang, and H. Qu, "Anchorage: Visual Analysis of Satisfaction in Customer Service Videos Via Anchor Events," IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 4008–4022, July 2024. DOI: https://doi.org/10.1109/TVCG.2023.3245609

C. Gan, Y. Tang, X. Fu, Q. Zhu, D. K. Jain, and S. García, "Video multimodal sentiment analysis using cross-modal feature translation and dynamical propagation," Knowledge-Based Systems, vol. 299, Sept. 2024, Art. no. 111982. DOI: https://doi.org/10.1016/j.knosys.2024.111982

N. Sabharwal and A. Agrawal, "BERT Model Applications: Question Answering System," in Hands-on Question Answering Systems with BERT, Berkeley, CA: Apress, 2021, pp. 97–137. DOI: https://doi.org/10.1007/978-1-4842-6664-9_5

P. Delobelle, T. Winters, and B. Berendt, "RobBERT: a Dutch RoBERTa-based Language Model," in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 3255–3265. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.292

Index of /CMU-MOSEI. (2020), Carnegie Mellon University. [Online]. Available: http://immortal.multicomp.cs.cmu.edu/CMU-MOSEI/.

Downloads

How to Cite

[1]
S. B. Deshpande, G. K. O. Michael, U. U. Deshpande, K. S. Mathad, N. V. Karekar, and K. K. Tangod, “Integrating Facial Emotion Recognition, Speech to Text Transcription, and Natural Language Processing for Customer Satisfaction Analysis from Video Reviews”, Eng. Technol. Appl. Sci. Res., vol. 16, no. 2, pp. 34615–34622, Apr. 2026.

Metrics

Abstract Views: 80
PDF Downloads: 51

Metrics Information