Transformer Hyperparameter Tuning for Madurese-Indonesian Machine Translation
Received: 6 December 2024 | Revised: 28 January 2025 | Accepted: 5 March 2025 | Online: 3 April 2025
Corresponding author: Fika Hastarita Rachman
Abstract
The main problem arising in using Neural Machine Translation (NMT) for the Madurese language is the limitation of training data due to the unavailability of an adequate parallel corpus. In addition, the model must overcome the difference in words caused by the level of politeness in the Madurese language (coarse, moderate, and smooth). The rules-based approach requires many rules to represent these differences. In contrast, the statistical approach relies on the frequency of words in the training data, which cannot accurately capture variations in politeness levels. To overcome this problem, a parallel corpus was created to provide adequate training data, and an embedding matrix based on Skip Gram with Negative Sampling (SGNS) was used to produce better word representations for processing with transformers. This study also employs two types of evaluation: model configuration based on dataset size (large and small) and two tokenization methods (word and subword levels). The best results were obtained with the large dataset using word-level tokenization, achieving 0.70% accuracy for entirely correct text, 78.87% for partially correct text, and a BLEU score ranging from 4.76 to 27.63 with a maximum n-gram value from 1 to 4. This approach improved translation accuracy and shows significant potential for developing NMT systems for languages with limited resources, such as the Madurese language.
Keywords:
machine translation, neural machine translation, transformers, Madurese , Indonesian, subword tokenization, word piecesDownloads
References
S. I. Abdullah and M. C. Yunita, "Distribution of Daily Use Local Language in Indonesia," International Conference on Education and Language (ICEL), vol. 1, May 2014.
antaranews.com, "East Java becomes favorite domestic destination in 2023," Antara News, Jun. 09, 2024. https://en.antaranews.com/news/315621/east-java-becomes-favorite-domestic-destination-in-2023.
D. Haerudin, R. Dallyono, U. Kuswari, and D. Koswara, "Examining language attitudes and use: A survey of Indonesian university students’ loyalty to their ethnic languages," Indonesian Journal of Applied Linguistics, vol. 14, no. 1, pp. 104–117, May 2024.
Anwari, W. Purnaningtyas, and U. Hikmah, "The Level of Language Used by Madurese in Kalidandan, Pakuniran, Probolinggo," presented at the 1st International Conference on Science, Health, Economics, Education and Technology (ICoSHEET 2019), Jul. 2020, pp. 86–89.
Y. Bawono and W. P. Wibowo, "Preserving Madurese Language, Is It Important?," in Proceeding International Seminar of Multicultural Psychology, 2023, vol. 3.
S. Bird and D. Chiang, "Machine Translation for Language Preservation," in Proceedings of COLING 2012, Dec. 2012, pp. 125–134.
H. Li and M. Ran, "Revitalizing Heritage Language through Natural Language Processing: Innovations and Challenges," Rajapark International Journal, vol. 1, no. 1, pp. 82–91, Feb. 2024.
Y. Wu et al., "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." arXiv, Oct. 08, 2016.
F. A. Khan and A. Abubakar, "Machine Translation in Natural Language Processing by Implementing Artificial Neural Network Modelling Techniques: An Analysis," International Journal on Perceptive and Cognitive Computing, vol. 6, no. 1, pp. 9–18, Jul. 2020.
W. J. Hutchins, "Machine Translation: A Brief History," in Concise History of the Language Sciences, E. F. K. Koerner and R. E. Asher, Eds. Pergamon, 1995, pp. 431–445.
Y. Yuxiu, "Application of translation technology based on AI in translation teaching," Systems and Soft Computing, vol. 6, Dec. 2024, Art. no. 200072.
J. Hu, "Neural Machine Translation (NMT): Deep learning approaches through Neural Network Models," Applied and Computational Engineering, vol. 82, pp. 93–99, Nov. 2024.
S. Martin, "Advancements in Neural Machine Translation: Techniques and Applications," Journal of Innovative Technologies, vol. 7, no. 1, May 2024.
D. I. De Silva and I. S. Gallage, "The Role of Syntax and Semantics in Rule-Based Translation: A Comprehensive Review," in 2024 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), Bahir Dar, Ethiopia, Nov. 2024, pp. 229–234.
S. M. U. Qumar, M. Azim, and S. M. K. Quadri, "Addressing the data gap: building a parallel corpus for Kashmiri language," International Journal of Information Technology, vol. 16, no. 7, pp. 4363–4379, Oct. 2024.
M. A. Faheem, K. T. Wassif, H. Bayomi, and S. M. Abdou, "Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation," Scientific Reports, vol. 14, no. 1, Jan. 2024, Art. no. 2265.
A. S. Dhanjal and W. Singh, "A comprehensive survey on automatic speech recognition using neural networks," Multimedia Tools and Applications, vol. 83, no. 8, pp. 23367–23412, Mar. 2024.
A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, 2017, vol. 30.
M. X. Chen et al., "The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation," in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 76–86.
F. H. Rachman, N. Ifada, S. Wahyuni, G. D. Ramadani, and A. Pawitra, "ModifiedECS (mECS) Algorithm for Madurese-Indonesian Rule-Based Machine Translation," in 2022 International Conference of Science and Information Technology in Smart Administration (ICSINTESA), Denpasar, Bali, Indonesia, Nov. 2022, pp. 51–56.
N. Ifada, F. H. Rachman, M. W. M. A. Syauqy, S. Wahyuni, and A. Pawitra, "MadureseSet: Madurese-Indonesian Dataset," Data in Brief, vol. 48, Jun. 2023, Art. no. 109035.
S. Lankford, H. Afli, and A. Way, "Transformers for Low-Resource Languages: Is Féidir Linn!" arXiv, Mar. 04, 2024.
S. J. Mielke et al., "Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP." arXiv, Dec. 20, 2021.
D. Sundararaman et al., "Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding." arXiv, Nov. 10, 2019.
R. Sennrich, B. Haddow, and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, vol. 1, pp. 1715–1725.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778.
J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer Normalization." arXiv, Jul. 21, 2016.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, Jan. 2014.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," in Advances in Neural Information Processing Systems, 2013, vol. 26.
S. Zainudin, S. A. Kusuma, and Barijati, Bahasa Madura. Jakarta, Indonesia: Pusat Pembinaan dan Pengembangan Bahasa, 1978.
A. Imankulova, T. Sato, and M. Komachi, "Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 2, pp. 1–16, Mar. 2020.
A. K. Ngo Ho and F. Yvon, "Optimizing Word Alignments with Better Subword Tokenization," in Proceedings of the 18th Biennial Machine Translation Summit :Volume 1: Research Track, Dec. 2021.
H. Sujaini, "Mesin Penerjemah Situs Berita Online Bahasa Indonesia ke Bahasa Melayu Pontianak," ELKHA : Jurnal Teknik Elektro, vol. 6, no. 2, Nov. 2014.
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, 2001.
Downloads
How to Cite
License
Copyright (c) 2025 Fika Hastarita Rachman, M. Wildan Mubarok Asy Syauqi, Noor Ifada, Imamah, Sri Wahyuni

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.