Exploring the Impact of Annotation Schemes on Arabic Named Entity Recognition across General and Specific Domains
Received: 11 January 2025 | Revised: 11 February 2025, 15 February 2025, and 18 February 2025 | Accepted: 21 February 2025 | Online: 14 March 2025
Corresponding author: Taoufiq El Moussaoui
Abstract
Named Entity Recognition (NER) is a fundamental task in natural language processing (NLP) that involves identifying and classifying entities into predefined categories. Despite its importance, the impact of annotation schemes and their interaction with domain types on NER performance, particularly for Arabic, remains underexplored. This study examines the influence of seven annotation schemes (IO, BIO, IOE, BIOES, BI, IE, and BIES) on arabic NER performance using the general-domain ANERCorp dataset and a domain-specific Moroccan legal corpus. Three models were evaluated: Logistic Regression (LR), Conditional Random Fields (CRF), and the transformer-based Arabic Bidirectional Encoder Representations from Transformers (AraBERT) model. Results show that the impact of annotation schemes on performance is independent of domain type. Traditional Machine Learning (ML) models such as LR and CRF perform best with simpler annotation schemes like IO due to their computational efficiency and balanced precision-recall metrics. On the other hand, AraBERT excels with more complex schemes (BIOES, BIES), achieving superior performance in tasks requiring nuanced contextual understanding and intricate entity relationships, though at the cost of higher computational demands and execution time. These findings underscore the trade-offs between annotation scheme complexity and computational requirements, offering valuable insights for designing NER systems tailored to both general and domain-specific Arabic NLP applications.
Keywords:
Arabic named entity recognition, annotation schemes, general-domain NER, domain-specific NER, AraBERTDownloads
References
T. E. Moussaoui and C. Loqman, "Advancements in Arabic Named Entity Recognition: A Comprehensive Review," IEEE Access, vol. 12, pp. 180238–180266, 2024.
E. F. T. K. Sang and S. Buchholz, "Introduction to the CoNLL-2000 Shared Task: Chunking," 2000.
N. Alshammari and S. Alanazi, "The impact of using different annotation schemes on named entity recognition," Egyptian Informatics Journal, vol. 22, no. 3, pp. 295–302, Sep. 2021.
I. Belhajem, "Effects of Multiple Annotation Schemes on Arabic Named Entity Recognition," Engineering, Technology & Applied Science Research, vol. 14, no. 5, pp. 17060–17067, Oct. 2024.
M. Kamran and S. Mansoor, "Named Entity Recognition System for Postpositional Languages: Urdu as a Case Study," International Journal of Advanced Computer Science and Applications, vol. 7, no. 10, 2016.
M. Konkol and M. Konopík, "Segment Representations in Named Entity Recognition," in Text, Speech, and Dialogue, vol. 9302, P. Král and V. Matoušek, Eds. Cham: Springer International Publishing, 2015, pp. 61–70.
A. Tkachenko, T. Petmanson, and S. Laur, "Named Entity Recognition in Estonian," in Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, Aug. 2013, pp. 78–83.
D. O. F. do Amaral, M. Buffet, and R. Vieira, "Comparative Analysis between Notations to Classify Named Entities using Conditional Random Fields," in Proceedings of Symposium in Information and Human Language Technology, Natal, RN, Brazil, Nov. 2015, pp. 27–31.
Y. Benajiba, P. Rosso, and J. M. BenedíRuiz, "ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy," in Computational Linguistics and Intelligent Text Processing, vol. 4394, A. Gelbukh, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 143–153.
J. D. Lafferty, A. McCallum, and F. C. N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," in ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, Jun. 2001, pp. 282–289.
W. Antoun, F. Baly, and H. Hajj, "AraBERT: Transformer-based Model for Arabic Language Understanding." arXiv, Mar. 07, 2021.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, May 24, 2019.
Downloads
How to Cite
License
Copyright (c) 2025 Taoufiq El Moussaoui, Chakir Loqman, Jaouad Boumhidi

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.