Integrating Facial Emotion Recognition, Speech to Text Transcription, and Natural Language Processing for Customer Satisfaction Analysis from Video Reviews
Received: 25 September 2025 | Revised: 27 October 2025, 5 November 2025, 9 December 2025, and 11 December 2025 | Accepted: 13 December 2025 | Online: 31 March 2026
Corresponding author: Goh Kah Ong Michael
Abstract
Customer satisfaction is a decisive factor in the success of products and services provided, yet conventional text-based reviews often fail to capture the full spectrum of user emotions needed to assess satisfaction. On the other hand, video product or service reviews offer a more informative medium for evaluating customer satisfaction. To leverage this, the present study proposes a multimodal machine learning framework for video-based customer feedback analysis, integrating facial emotion recognition, speech-to-text transcription, and Natural Language Processing (NLP). A dataset of 1,000 video reviews was processed through a multistage pipeline that involved frame extraction, face detection, emotion classification, audio transcription, sentiment analysis, and late fusion of modalities. Experimental results highlight the limitations of unimodal models: visual-only sentiment prediction achieved 62.3% accuracy (precision = 0.61, recall = 0.63, F1-score = 0.62, Area Under Curve (AUC) = 0.65), while audio-only sentiment prediction reached 59.5% accuracy (precision = 0.58, recall = 0.59, F1-score = 0.59, AUC = 0.61). The text-based model provided a stronger baseline at 72.1% accuracy (precision = 0.70, recall = 0.72, F1-score = 0.71, AUC = 0.75). In contrast, the multimodal fusion framework substantially outperformed unimodal approaches, achieving 79.9% accuracy, precision = 0.80, recall = 0.81, F1-score = 0.80, and the highest AUC of 0.86. Additionally, aspect-level analysis revealed that camera quality (+0.16) was the most positively perceived feature, while app performance (-0.33) and delivery (-0.09) emerged as primary concerns. Temporal analysis showed satisfaction scores fluctuating between 52.1 and 63.4 (0-100 scale) over 20 weeks, underscoring the value of continuous monitoring. These findings demonstrate that multimodal video feedback analysis yields more comprehensive, reliable, and fair performance than single-channel methods.
Keywords:
customer satisfaction, video feedback, emotion recognition, sentiment analysis, facial emotions, product feedbackDownloads
References
E. Marrese-Taylor, C. Rodriguez, J. Balazs, S. Gould, and Y. Matsuo, "A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews," in Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), 2020, pp. 8–18. DOI: https://doi.org/10.18653/v1/2020.challengehml-1.2
J. S. Chu and S. Ghanta, "Integrative Sentiment Analysis: Leveraging Audio, Visual, and Textual Data," in AI, Machine Learning and Applications, Jan. 2024, pp. 155–169. DOI: https://doi.org/10.5121/csit.2024.140211
N. Rane, S. Choudhary, and J. Rane, "Artificial intelligence, machine learning, and deep learning for sentiment analysis in business to enhance customer experience, loyalty, and satisfaction," SSRN Electronic Journal, 2024. DOI: https://doi.org/10.2139/ssrn.4846145
M. A. Kausar, S. O. Fageeri, and A. Soosaimanickam, "Sentiment Classification based on Machine Learning Approaches in Amazon Product Reviews," Engineering, Technology & Applied Science Research, vol. 13, no. 3, pp. 10849–10855, June 2023. DOI: https://doi.org/10.48084/etasr.5854
D. Singh, N. K. Pandey, V. Gupta, M. Prajapati, and R. Senapati, "Beyond Textual Analysis: Framework for CSAT Score Prediction with Speech and Text Emotion Features," in 2024 IEEE International Conference on Computer Vision and Machine Intelligence (CVMI), Oct. 2024, pp. 1–6. DOI: https://doi.org/10.1109/CVMI61877.2024.10782234
D. Thamaraiselvi, J. Pranay, and S. H. Kasyap, "Emotion Detection from Video and Audio and Text," Interantional Journal of Scientific Research in Engineering and Management, vol. 09, no. 01, pp. 1–9, Jan. 2025. DOI: https://doi.org/10.55041/IJSREM40494
T. Zhang and Z. Tan, "Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey." Preprints, Nov. 2023. DOI: https://doi.org/10.1007/s11042-023-17944-9
K. Qiu, Y. Zhang, J. Zhao, S. Zhang, Q. Wang, and F. Chen, "A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism," Electronics, vol. 13, no. 10, May 2024, Art. no. 1922. DOI: https://doi.org/10.3390/electronics13101922
C. Á. Iglesias, J. F. Sánchez-Rada, P. Buitelaar, and F. Danza, "Mixed Emotions - Social Semantic Emotion Analysis for Innovative Multilingual Big Data Analytics Markets:," in European Project Space on Intelligent Technologies, Software engineering, Computer Vision, Graphics, Optics and Photonics, 2016, pp. 116–123. DOI: https://doi.org/10.5220/0007904101160123
T. I. Deepika and A. N. Sigappi, "Multi-Model Emotion Recognition from Voice Face and Text Sources using Optimized Progressive Neural Network: A Literature Review," in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Dec. 2024, pp. 1245–1253. DOI: https://doi.org/10.1109/ICACRS62842.2024.10841591
K. P. Seng and L.-M. Ang, "Video Analytics for Customer Emotion and Satisfaction at Contact Centers," IEEE Transactions on Human-Machine Systems, vol. 48, no. 3, pp. 266–278, June 2018. DOI: https://doi.org/10.1109/THMS.2017.2695613
K. Robinson, A. Martinez, and E. Turner, "Enhanced Modal Fusion Learning for Multimodal Sentiment Interpretation." Computer Science and Mathematics, Sept. 2024. DOI: https://doi.org/10.20944/preprints202409.1887.v1
H. Gu, G. Jin, Y. Zhao, and R. Cui, "Multi-task Multimodal Sentiment Analysis Based on Visual Data Mining," in 2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS), Sept. 2024, pp. 1047–1050. DOI: https://doi.org/10.1109/EIECS63941.2024.10800562
K. K. Wong, X. Wang, Y. Wang, J. He, R. Zhang, and H. Qu, "Anchorage: Visual Analysis of Satisfaction in Customer Service Videos Via Anchor Events," IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 7, pp. 4008–4022, July 2024. DOI: https://doi.org/10.1109/TVCG.2023.3245609
C. Gan, Y. Tang, X. Fu, Q. Zhu, D. K. Jain, and S. García, "Video multimodal sentiment analysis using cross-modal feature translation and dynamical propagation," Knowledge-Based Systems, vol. 299, Sept. 2024, Art. no. 111982. DOI: https://doi.org/10.1016/j.knosys.2024.111982
N. Sabharwal and A. Agrawal, "BERT Model Applications: Question Answering System," in Hands-on Question Answering Systems with BERT, Berkeley, CA: Apress, 2021, pp. 97–137. DOI: https://doi.org/10.1007/978-1-4842-6664-9_5
P. Delobelle, T. Winters, and B. Berendt, "RobBERT: a Dutch RoBERTa-based Language Model," in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 3255–3265. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.292
Index of /CMU-MOSEI. (2020), Carnegie Mellon University. [Online]. Available: http://immortal.multicomp.cs.cmu.edu/CMU-MOSEI/.
Downloads
How to Cite
License
Copyright (c) 2026 Sudhindra B. Deshpande, Goh Kah Ong Michael, Uttam U. Deshpande, K. S. Mathad, N. V. Karekar, Kiran K. Tangod

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
