A Robust and Integrated Speech Recognition Tool for Dysarthria Patients Using Lip Movement Recognition

May Altulyan

doi:10.48084/etasr.17799

Authors

May Altulyan Department of Computer Engineering, College of Computer Science and Engineering, Prince Sattam Bin Abdulaziz University, Al Kharj, Saudi Arabia

Volume: 16 | Issue: 3 | Pages: 35660-35669 | June 2026 | https://doi.org/10.48084/etasr.17799

Received: 27 January 2026 | Revised: 7 March 2026, 4 April 2026, 11 April 2026, and 18 April 2026 | Accepted: 21 April 2026 | Online: 6 June 2026

Corresponding author: May Altulyan

Abstract

Recent advances in human-computer interaction have led to marked innovations in the development of computer-aided tools for the disabled and survivors of neurological diseases. Dysarthria is a neurological disorder that affects muscles, impacting speech articulation and clarity. Brain tumors, Cerebral palsy, Parkinson's disease, and head injuries also affect the movement of the tongue, leading to unclear speech. Dysarthria speech is intelligible and poses an arduous challenge for voice recognition systems developed based on speech signal processing. This study presents a model based on lip movement recognition and transfer learning to develop a robust speech recognition tool for patients with dysarthria. Lip recognition is based on a 3D Convolutional Neural Network (CNN) and a Bidirectional Long-Short-Term Memory (BiLSTM) neural network for lip movement detection and speech recognition. The proposed speech recognition model was trained on the GRID sentence Corpus dataset. Dysarthria speech can be recognized using transfer learning. The speech data is converted to text by the lip recognition model, and the text data is analyzed by transformers for word prediction and grammar correction. The novelty of the proposed framework is that it not only recognizes speech data but also improves the text recognized with a sequence-to-sequence T5 transformer model to improve speech recognition. The lip movement recognition model had an accuracy of 98.29% and a precision of 99.58%. The accuracy of the transformer grammar correction model was 78% due to limited training. The proposed integrated model is a novel idea that uses lip movement recognition rather than speech data for speech recognition, demonstrating high performance.

Keywords:

dysarthria, lip movement recognition, transformers, speech processing, human computer interaction, BiLSTM

References

A. B. Kain, J. P. Hosom, X. Niu, J. P. H. van Santen, M. Fried-Oken, and J. Staehely, "Improving the intelligibility of dysarthric speech," Speech Communication, vol. 49, no. 9, pp. 743–759, Sept. 2007.

R. Kumar, M. Tripathy, N. Kumar, and R. S. Anand, "Categorization of patients affected with neurogenerative dysarthria among Hindi-speaking population and analyzing factors causing reduced speech intelligibility at the human-machine interface," Speech Communication, vol. 175, Nov. 2025, Art. no. 103328.

A. Souky, "Making Speech Happen: The Five Processes Behind Every Word We Say," Speech & Swallowing Solutions of the Capital Region LLC, Feb. 10, 2025. https://speechswallowingsolutions.com/how-we-speak/.

"Dysarthria in Adults," American Speech-Language-Hearing Association. https://www.asha.org/practice-portal/clinical-topics/dysarthria-in-adults/.

D. K. Jayaraman and J. M. Das, Dysarthria. StatPearls Publishing, 2023.

M. K. V. Es et al., "Dysphagia and Dysarthria in Children with Neuromuscular Diseases, a Prevalence Study," Journal of Neuromuscular Diseases, vol. 7, no. 3, pp. 287–295, June 2020.

K. Kang et al., "Digital speech assessments and machine learning for differentiation of neurodegenerative diseases," Clinical Parkinsonism & Related Disorders, vol. 13, 2025, Art. no. 100389.

J. Tröger et al., "An automatic measure for speech intelligibility in dysarthrias—validation across multiple languages and neurological disorders," Frontiers in Digital Health, vol. 6, July 2024, Art. no. 1440986.

T. Pu et al., "Lee Silverman Voice Treatment to Improve Speech in Parkinson’s Disease: A Systemic Review and Meta-Analysis," Parkinson’s Disease, vol. 2021, pp. 1–10, Dec. 2021.

J. A. Russell, M. R. Ciucci, N. P. Connor, and T. Schallert, "Targeted exercise therapy for voice and swallow in persons with Parkinson’s disease," Brain Research, vol. 1341, pp. 3–11, June 2010.

J. Mills, O. Duffy, K. Pedlow, and G. Kernohan, "Exploring Speech and Language Therapists’ Perspectives of Voice-Assisted Technology as a Tool for Dysarthria: Qualitative Study," JMIR Rehabilitation and Assistive Technologies, vol. 12, Sept. 2025, Art. no. e75044.

A. Kehili, Κ. Dabbabi, and A. Cherif, "Early Detection of Parkinson’s and Alzheimer’s Diseases using the VOT_Mean Feature," Engineering, Technology & Applied Science Research, vol. 11, no. 2, pp. 6912–6918, Apr. 2021.

S. Salim, S. Shahnawazuddin, and W. Ahmad, "Enhancing voice biometrics for dysarthria patients using novel temporal discriminative feature embedding," Digital Signal Processing, vol. 168, Jan. 2026, Art. no. 105662.

S. Aurobindo, R. Prakash , and M. Rajeshkumar, "Comparative analysis of different time-frequency image representations for the detection and severity classification of dysarthric speech using deep learning," Results in Engineering, vol. 25, Mar. 2025, Art. no. 104561.

S. Sajiha, K. Radha, D. Venkata Rao, N. Sneha, S. Gunnam, and D. P. Bavirisetti, "Automatic dysarthria detection and severity level assessment using CWT-layered CNN model," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, no. 1, June 2024, Art. no. 33.

M. S. Remya, P. Ishwar, and P. Nedungadi, "A Hybrid Cross-Attentive CNN-BiLSTM-Transformer Network for Dysarthria Severity Classification," Scientific Reports, vol. 15, no. 1, Nov. 2025, Art. no. 42080.

B. Moell and F. S. Aronsson, "Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology." arXiv, 2025.

M. T. A Celin., P. Vijayalakshmi, T. Nagarajan, and K. Mrinalini, "Augmentative and alternative speech communication (AASC) aid for people with dysarthria," Computer Speech & Language, vol. 92, June 2025, Art. no. 101777.

M. Kim, Y. Kim, J. Yoo, J. Wang, and H. Kim, "Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 25, no. 9, pp. 1581–1591, Sept. 2017.

N. M. Joy and S. Umesh, "Improving Acoustic Models in TORGO Dysarthric Speech Database," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 26, no. 3, pp. 637–645, Mar. 2018.

M. S. Yakoub, S. Selouani, B. F. Zaidi, and A. Bouchair, "Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2020, no. 1, Dec. 2020.

S. Salim and W. Ahmad, "Advancing Voice Biometrics for Dysarthria Speakers Using Multitaper LFCC and Voice Conversion Data Augmentation," IEEE Transactions on Information Forensics and Security, vol. 19, pp. 10114–10129, 2024.

R. Vinotha, D. Hepsiba, L. D. Vijay Anand, J. Andrew, and R. Jennifer Eunice, "Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning," Scientific Reports, vol. 14, no. 1, Nov. 2024, Art. no. 29455.

M. Thomas, E. Fish, and R. Bowden, "VALLR: Visual ASR Language Model for Lip Reading." arXiv, 2025.

Y. Fu and Y. Lu, "Lip-Reading Research Based on ShuffleNet and Attention-GRU," presented at the 10th International Conference on Human Interaction and Emerging Technologies (IHIET 2023), 2023.

C. Innocente et al., "Deep Learning-Based Lip-Reading for Vocal Impaired Patient Rehabilitation," Computer Modeling in Engineering & Sciences, vol. 143, no. 2, pp. 1355–1379, 2025.

"The GRID audiovisual sentence corpus." [Online]. Available: https://spandh.dcs.shef.ac.uk/gridcorpus/.

S. Kumar, S. Datta, V. Singh, S. K. Singh, and R. Sharma, "Opportunities and Challenges in Data-Centric AI," IEEE Access, vol. 12, pp. 33173–33189, 2024.

H. B. Abdalla et al., "The Future of Artificial Intelligence in the Face of Data Scarcity," Computers, Materials & Continua, vol. 84, no. 1, pp. 1073–1099, 2025.

M. Hähnel, "Ethical challenges and solutions in AI-driven medical data management: a focus on distributed machine learning," Discover Artificial Intelligence, vol. 5, no. 1, May 2025, Art. no. 53.

H. Kheddar, "Transformers and large language models for efficient intrusion detection systems: A comprehensive survey," Information Fusion, vol. 124, Dec. 2025, Art. no. 103347.

G. Antonesi, T. Cioara, I. Anghel, V. Michalakopoulos, E. Sarmas, and L. Toderean, "A systematic review of transformers and large language models in the energy sector: towards agentic digital twins," Applied Energy, vol. 401, Dec. 2025, Art. no. 126670.

S. Li and Y. Sung, "Transformer-Based Seq2Seq Model for Chord Progression Generation," Mathematics, vol. 11, no. 5, Feb. 2023, Art. no. 1111.

S. Grassi, "Examining the limitations and challenges of using Transformers for time series forecasting." ResearchGate, 2024.

L. Yang and S. Qiu, "BLEU Function Analysis of Machine Translation Based on Transformer Model," in Proceedings of the 2024 International Conference on Artificial Intelligence, Digital Media Technology and Interaction Design, Nov. 2024, pp. 230–236.

A. Vaswani et al., "Attention is All you Need," in Advances in Neural Information Processing Systems, 2017, vol. 30.

H. A. Z. S. Shahgir and K. S. Sayeed, "Bangla Grammatical Error Detection Using T5 Transformer Model." arXiv, 2023.

Y. Jiang and R. Dale, "Mapping the learning curves of deep learning networks," PLOS Computational Biology, vol. 21, no. 2, Feb. 2025, Art. no. e1012286.

Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, "LipNet: End-to-End Sentence-level Lipreading." arXiv, Dec. 16, 2016.

Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki, "Lip reading using a dynamic feature of lip images and convolutional neural networks," in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), June 2016, pp. 1–6.

J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, "Lip Reading Sentences in the Wild," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 3444–3453.

K. Xu, D. Li, N. Cassimatis, and X. Wang, "LCANet: End-to-End Lipreading with Cascaded Attention-CTC," in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), May 2018, pp. 548–555.

Y. Li, A. S. Hashim, Y. Lin, P. N. E. Nohuddin, K. Venkatachalam, and A. Ahmadian, "AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse," Applied Soft Computing, vol. 164, Oct. 2024, Art. no. 111906.

K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "BLEU: A method for automatic evaluation of machine translation," presented at the Annual Meeting of the Association for Computational Linguistics, July 2002.