A Multimodal Transformer-LSTM Framework for Cross-Lingual Lexical Alignment of Indonesian Regional Languages

Ema Utami; Sri Ngudi Wahyuni; Mulia Sulistiyono; Suwanto Raharjo; Anggit Dwi Hartanto; Arif Nur Rohman; Bambang Krismono Triwijoyo; Titik Ceriyani Miswaty; Elyakim Nova Supriyedi Patty; Fahry

doi:10.48084/etasr.17086

Authors

Ema Utami Faculty of Computer Science, Universitas AMIKOM Yogyakarta, Indonesia
Sri Ngudi Wahyuni Faculty of Computer Science, Universitas AMIKOM Yogyakarta, Indonesia
Mulia Sulistiyono Faculty of Computer Science, Universitas AMIKOM Yogyakarta, Indonesia
Suwanto Raharjo Faculty of Information Technology and Business, Universitas AKPRIND, Indonesia
Anggit Dwi Hartanto Faculty of Computer Science, Universitas AMIKOM Yogyakarta, Indonesia
Arif Nur Rohman Faculty of Computer Science, Universitas AMIKOM Yogyakarta, Indonesia
Bambang Krismono Triwijoyo Faculty of Engineering, Universitas Bumigora, Indonesia
Titik Ceriyani Miswaty Faculty of Humanities, Law and Tourism, Universitas Bumigora, Indonesia
Elyakim Nova Supriyedi Patty Faculty of Education, Universitas Bumigora, Indonesia
Fahry Faculty of Engineering, Universitas Bumigora, Indonesia

Volume: 16 | Issue: 2 | Pages: 34283-34292 | April 2026 | https://doi.org/10.48084/etasr.17086

Received: 21 December 2025 | Revised: 7 February 2026 and 28 February 2026 | Accepted: 2 March 2026 | Online: 31 March 2026

Corresponding author: Ema Utami

Abstract

Research on cross-linguistic lexical alignment in Indonesian regional languages is still limited, especially in the use of multimodal data and the systematic evaluation of multilingual Transformer models. Most previous studies have relied on single-text data and semantic similarity-based approaches without providing adequate empirical evidence regarding the performance of literal lexical alignment derived from multimodal transcription. These limitations indicate the urgency for research to examine multimodal approaches in the task of cross-linguistic lexical alignment in Indonesian regional languages. This study contributes through the development and evaluation of a multimodal framework based on a Transformer and Long Short-Term Memory (LSTM) architecture for lexical alignment across languages. The novelty of this research lies in the utilization of primary field data from native speakers curated into a multimodal corpus of Indonesian regional languages, as well as in the comparative evaluation of a multilingual Transformer model for literal lexical alignment based on audio and visual transcription. Audio and visual data were transcribed using Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) to produce cross-modality text representations. The representations were processed using the mBERT, DistilBERT, XLM-R Large, and LaBSE models, with the support of sequential modeling using LSTM. This dataset consists of 6,000 parallel multimodal utterances. Experimental results showed that the multilingual Transformer model was effective in performing cross-linguistic lexical alignment, with LaBSE producing the most consistent performance in terms of Accuracy, Mean Reciprocal Rank (MRR), and BLEU metrics. These findings make a significant empirical contribution to the study of multimodal-based cross-language lexical alignment for Indonesian regional languages.

Keywords:

SaSaMbo, multimodal, transformer, LSTM, ASR, OCR

Downloads

Download data is not yet available.

References

R. Pramana, M. Jonathan, H. S. Yani, and R. Sutoyo, "A Comparison of BiLSTM, BERT, and Ensemble Method for Emotion Recognition on Indonesian Product Reviews," Procedia Computer Science, vol. 245, pp. 399–408, Jan. 2024. DOI: https://doi.org/10.1016/j.procs.2024.10.266

M. Lupaşcu, A. C. Rogoz, M. S. Stupariu, and R. T. Ionescu, "Large multimodal models for low-resource languages: A survey," Information Fusion, vol. 131, July 2026, Art. no. 104189. DOI: https://doi.org/10.1016/j.inffus.2026.104189

P. Pakray, A. Gelbukh, and S. Bandyopadhyay, "Natural language processing applications for low-resource languages," Natural Language Processing, vol. 31, no. 2, pp. 183–197, Mar. 2025. DOI: https://doi.org/10.1017/nlp.2024.33

D. D. Baishya, D. R. Baruah, D. M. Bora, and B. Sarma, "Processing Low-Resource Languages: A Review Of Challenges And Strategies For Inclusive NLP And Sustainable Environment," International Journal of Environmental Sciences, pp. 7730–7739, Sept. 2025. DOI: https://doi.org/10.64252/w55rwj24

S. Raharjo, E. Utami, E. Sutanta, and N. R. Az–Zahra Raharema, "Korpus.id: A Database-Driven Approach to Linguistic Annotation and Analysis," in 2025 Eighth International Women in Data Science Conference at Prince Sultan University (WiDS PSU), Apr. 2025, pp. 151–156. DOI: https://doi.org/10.1109/WiDS-PSU64963.2025.00040

T. Dalai, A. Das, T. K. Mishra, and P. K. Sa, "OdNER: NER resource creation and system development for low-resource Odia language," Natural Language Processing Journal, vol. 11, June 2025, Art. no. 100139. DOI: https://doi.org/10.1016/j.nlp.2025.100139

A. S. Ekakristi, A. F. Wicaksono, and R. Mahendra, "Intermediate-task transfer learning for Indonesian NLP tasks," Natural Language Processing Journal, vol. 12, Sept. 2025, Art. no. 100161. DOI: https://doi.org/10.1016/j.nlp.2025.100161

A. Maesya, Y. Arifin, A. Zahra, and W. Budiharto, "AMSunda: A novel dataset for Sundanese information retrieval," Data in Brief, vol. 61, Aug. 2025, Art. no. 111796. DOI: https://doi.org/10.1016/j.dib.2025.111796

Z. Zainuddin, Mudassir, and Z. Tahir, "Entity Extraction in Indonesian Online News Using Named Entity Recognition (NER) with Hybrid Method Transformer, Word2Vec, Attention and Bi-LSTM," JOIV : International Journal on Informatics Visualization, vol. 9, no. 3, pp. 964–973, May 2025. DOI: https://doi.org/10.62527/joiv.9.3.2902

A. Banerjee and D. Banik, "A Comprehensive Survey on Transformer-Based Machine Translation: Identifying Research Gaps and Solutions for Large Language Models," ACM Computing Surveys, vol. 58, no. 5, Art. no. 124, Sept. 2025. DOI: https://doi.org/10.1145/3773076

L. Qin et al., "A survey of multilingual large language models," Patterns, vol. 6, no. 1, Jan. 2025. DOI: https://doi.org/10.1016/j.patter.2024.101118

H. Boutouta, A. Lakhfif, F. Senator, and C. Mediani, "A Transformer-based Hybrid Model for Implicit Emotion Recognition in Arabic Text," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 23834–23839, June 2025. DOI: https://doi.org/10.48084/etasr.10261

H. Zhang and M. O. Shafiq, "Survey of transformers and towards ensemble learning using transformers for natural language processing," Journal of Big Data, vol. 11, no. 1, Feb. 2024, Art. no. 25. DOI: https://doi.org/10.1186/s40537-023-00842-0

Z. Chen et al., "Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models," Computers, Materials & Continua, vol. 80, no. 2, pp. 1753–1808, 2024. DOI: https://doi.org/10.32604/cmc.2024.052618

I. D. Mienye, T. G. Swart, and G. Obaido, "Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications," Information, vol. 15, no. 9, Aug. 2024. DOI: https://doi.org/10.3390/info15090517

N. Khan et al., "Systematic Literature Review of Machine Learning Models and Applications for Text Recognition," IEEE Access, vol. 13, pp. 177647–177670, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3618109

S. Bhushan, V. Prakash Mishra, V. Rishiwal, S. Arunkumar, and U. Agarwal, "Advancing Text-to-Speech Systems for Low-Resource Languages: Challenges, Innovations, and Future Directions," IEEE Access, vol. 13, pp. 155729–155758, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3605236

X. Dong, "Computer-Assisted Multimodal Translanguaging Analysis in English Classrooms: A Deep-Learning and NLP Framework," Informatica, vol. 49, no. 37, Dec. 2025. DOI: https://doi.org/10.31449/inf.v49i37.10365

T. Seng, "Analysis of multimodal data recorded during videoconferences," Ph.D. dissertation, Université de Toulouse, France, 2025.

T. Baltrusaitis, C. Ahuja, and L.-P. Morency, "Multimodal Machine Learning: A Survey and Taxonomy," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, Feb. 2019. DOI: https://doi.org/10.1109/TPAMI.2018.2798607

Y. Hao and B. Zhou, "Multi-modal smart classroom topic segmentation," in Proceedings of the 2025 5th International Conference on Applied Mathematics, Modelling and Intelligent Computing, Mar. 2025, pp. 262–268. DOI: https://doi.org/10.1145/3745533.3745577

E. Marevac, E. Kadušić, N. Živić, N. Buzađija, E. Tabak, and S. Velić, "Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance," Algorithms, vol. 18, no. 9, Sept. 2025, Art. no. 572. DOI: https://doi.org/10.3390/a18090572

S. Wahyuni and E. Utami, "SaSamBo Corpus Dataset." Mendeley Data, Feb. 10, 2026.

M. I. Ragab, E. H. Mohamed, and W. Medhat, "Multilingual Propaganda Detection: Exploring Transformer-Based Models mBERT, XLM-RoBERTa, and mT5," in Proceedings of the first International Workshop on Nakba Narratives as Language Resources, Jan. 2025.

H. Wei et al., "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model." arXiv, Sept. 03, 2024.

P. T. Krishnan, A. N. Joseph Raj, and V. Rajangam, "Emotion classification from speech signal based on empirical mode decomposition and non-linear features: Speech emotion recognition," Complex & Intelligent Systems, vol. 7, no. 4, pp. 1919–1934, Aug. 2021.

S. N. Endah, . Suprapto, and Y. Suyanto, "Enhancing Low-Resource Dialectal ASR in Indonesian Using Speech-Transformer Models and Data Augmentation," Engineering, Technology & Applied Science Research, vol. 15, no. 5, pp. 28095–28101, Oct. 2025. DOI: https://doi.org/10.48084/etasr.12734

P. T. Krishnan, A. N. Joseph Raj, and V. Rajangam, "Emotion classification from speech signal based on empirical mode decomposition and non-linear features: Speech emotion recognition," Complex & Intelligent Systems, vol. 7, no. 4, pp. 1919–1934, Aug. 2021. DOI: https://doi.org/10.1007/s40747-021-00295-z

H. Kashid and P. Bhattacharyya, "RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages," in Proceedings of the 21st International Conference on Natural Language Processing (ICON), Sept. 2024, pp. 274–284.

A. P. Bhopale and A. Tiwari, "Transformer based contextual text representation framework for intelligent information retrieval," Expert Systems with Applications, vol. 238, Mar. 2024, Art. no. 121629. DOI: https://doi.org/10.1016/j.eswa.2023.121629

S. Islam et al., "A comprehensive survey on applications of transformers for deep learning tasks," Expert Systems with Applications, vol. 241, May 2024, Art. no. 122666. DOI: https://doi.org/10.1016/j.eswa.2023.122666

Y. Ma, "Cross-language Text Generation Using mBERT and XLM-R: English-Chinese Translation Task," in Proceedings of the 2024 International Conference on Machine Intelligence and Digital Applications, May 2024, pp. 602–608. DOI: https://doi.org/10.1145/3662739.3672320

V. Dhananjaya, S. Ranathunga, and S. Jayasena, "Lexicon-based fine-tuning of multilingual language models for low-resource language sentiment analysis," CAAI Transactions on Intelligence Technology, vol. 9, no. 5, pp. 1116–1125, 2024. DOI: https://doi.org/10.1049/cit2.12333

B. Li, "A Study of DistilBERT-Based Answer Extraction Machine Reading Comprehension Algorithm," in Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy, Mar. 2024, pp. 261–268. DOI: https://doi.org/10.1145/3672919.3672968

K. R. Mabokela, T. Celik, and M. Raborife, "Multilingual Sentiment Analysis for Under-Resourced Languages: A Systematic Review of the Landscape," IEEE Access, vol. 11, pp. 15996–16020, 2023. DOI: https://doi.org/10.1109/ACCESS.2022.3224136

F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, "Language-agnostic BERT Sentence Embedding," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 878–891. DOI: https://doi.org/10.18653/v1/2022.acl-long.62

M. I. Salih, S. M. Mohammed, A. K. Ibrahim, O. M. Ahmed, and L. M. Haji, "Fine-Tuning BERT for Automated News Classification," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 22953–22959, June 2025. DOI: https://doi.org/10.48084/etasr.10625

P. Přibáň, J. Šmíd, J. Steinberger, and A. Mištera, "A comparative study of cross-lingual sentiment analysis," Expert Systems with Applications, vol. 247, Aug. 2024, Art. no. 123247. DOI: https://doi.org/10.1016/j.eswa.2024.123247

E. Hashmi, S. Y. Yayilgan, and S. Shaikh, "Augmenting sentiment prediction capabilities for code-mixed tweets with multilingual transformers," Social Network Analysis and Mining, vol. 14, no. 1, Apr. 2024, Art. no. 86. DOI: https://doi.org/10.1007/s13278-024-01245-6

C. Kaoutar, L. Yasser, and S. Maha, "Machine Translation with Neural Networks Based on a Transformer," International Journal For Multidisciplinary Research, vol. 6, no. 5, Sept. 2024, Art. no. 26674. DOI: https://doi.org/10.36948/ijfmr.2024.v06i05.26674

Z. Li and Z. Ke, "Cross-Modal Augmentation for Low-Resource Language Understanding and Generation," in Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), 2025, pp. 90–99. DOI: https://doi.org/10.18653/v1/2025.magmar-1.9

İ. Ü. Oğul, F. Soygazi, and B. E. Bostanoğlu, "TurkMedNLI: a Turkish medical natural language inference dataset through large language model based translation," PeerJ Computer Science, vol. 11, Jan. 2025, Art. no. e2662. DOI: https://doi.org/10.7717/peerj-cs.2662