NeoAraBERT: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking

Abou Chakra, Chadi; Hamoud, Hadi; Rakan Al Mraikhat, Osama; Abu Obaida, Qusai; Ballout, Mohamad; Zaraket, Fadi A.

NeoAraBERT: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking

Chadi Abou Chakra, Hadi Hamoud, Osama Rakan Al Mraikhat, Qusai Abu Obaida, Mohamad Ballout, Fadi A. Zaraket

Abstract

We present NeoAraBERT, a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a novel synonym-based task that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.

Main Contributions

A modern Arabic encoder architecture built on NeoBERT.
Diacritics-aware tokenization designed to preserve lexical identity while retaining useful diacritic information.
POS-targeted masking that emphasizes semantically rich token groups.
A new synonym benchmark, Muradif, for direct embedding evaluation without task-specific fine-tuning.
Strong benchmark gains across dialectal Arabic, Modern Standard Arabic, and Classical Arabic.

Model Variants

NeoAraBERT_Mix: the most balanced checkpoint across MSA, dialectal Arabic, and Classical Arabic.
NeoAraBERT_MSA: strongest on Modern Standard Arabic tasks.
NeoAraBERT_DA: strongest on dialect-focused tasks.

Benchmark Comparison

Benchmark	Metric	Type	Mix	MSA	DA	AraBERTv2	ARBERTv2	MARBERTv2	AraModernBERT	CAMeLBERT-mix
Average	Score	Overall	83.79	83.30	83.44	80.75	80.31	80.45	81.04	80.04
Muradif (synonyms)	ACC	MSA	87.03	86.32	82.64	64.56	73.41	67.15	77.33	67.52
Wiki_news (diacritics)	ACC	MSA	94.84	94.74	94.21	89.13	89.38	84.50	89.42	94.35
ANTv2 Text	F1	MSA	88.31	88.71	88.47	88.13	88.45	87.97	87.16	88.09
ANTv2 Title	F1	MSA	81.85	82.97	82.65	82.48	82.35	82.37	81.27	81.62
ANTv2 Text + Title	F1	MSA	87.71	88.70	88.05	88.17	88.30	87.91	87.44	87.34
Al Khaleej	F1	MSA	95.39	95.42	94.79	94.89	95.23	95.18	94.81	94.91
ANERcorp.	μF1	MSA	81.42	80.12	80.76	82.10	82.23	78.88	73.41	80.83
WoojoodNER	μF1	MSA	91.36	91.90	90.93	90.91	84.72	89.12	91.08	88.43
Q2Q(STS)	F1	MSA	95.26	95.45	94.75	96.29	95.48	95.21	96.01	95.04
XNLI	F1	MSA	80.84	80.08	80.66	79.28	76.40	74.90	78.53	73.36
Woojood_hadath	F1	MSA	89.67	90.53	89.57	90.53	89.69	90.10	92.37	89.00
ArabicSense(reason)	F1	MSA	98.82	98.23	98.23	97.76	96.46	96.35	92.80	96.57
SALMA(POS)	μF1	MSA	97.26	97.02	96.67	94.59	97.04	96.10	93.70	96.57
ud (POS)	μF1	MSA	97.07	96.89	96.84	95.97	96.85	96.56	95.77	95.83
WSD	F1	MSA	83.51	82.18	81.57	83.48	80.76	79.74	79.74	79.98
AraSarcasm (Sarc)	F1	DA	73.48	74.19	75.24	74.42	74.97	76.48	72.27	71.91
AraSarcasm (Sent)	F1	DA	73.36	70.47	73.09	73.68	71.29	73.89	70.78	72.92
MAWQIF (Stance)	F1	DA	67.64	66.81	70.94	65.74	65.13	70.10	65.92	66.35
MAWQIF (Sent)	F1	DA	69.23	66.09	69.56	68.54	64.73	69.54	65.66	65.84
Arabic Dialects	F1	DA	79.18	75.32	78.20	77.36	79.02	78.39	75.89	76.96
APCD_meter	F1	CA	85.34	85.34	84.94	77.69	77.31	77.61	83.09	77.70
APCD_era	F1	CA	53.71	52.97	50.85	26.70	25.59	28.26	46.21	26.21
Poem_emotion	F1	CA	74.87	75.54	75.60	74.82	72.25	74.11	73.36	73.50

Citation

If you use the code, model, or the Muradif benchmark, please cite:

@inproceedings{abou-chakra-etal-2026-neoarabert,
  title = "{NeoAraBERT}: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking",
  author = "Abou Chakra, Chadi and
            Hamoud, Hadi and
            Rakan Al Mraikhat, Osama and
            Abu Obaida, Qusai and
            Ballout, Mohamad and
            Zaraket, Fadi A.",
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
  address = "San Diego, California, United States",
  year = "2026",
  note = "Accepted paper",
  url = "https://acr.ps/neoarabert",
  abstract = {We present NeoAraBERT, a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pre-train NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed more general POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a novel synonym-based task, ``Muradif'', that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants (MSA, dialectal, and mixed) rank first in 18 tasks, second in two, third in two, and fourth in one task. They show strong performance on classical and modern standard Arabic, substantial margins of improvement ($>$7\%) in two tasks, and a $+$2.75\% improvement on average across all tasks. Our code and links to checkpoints for our model variants are available on our website: \url{https://acr.ps/neoarabert}}
}

Acknowledgements

We would like to acknowledge Ahmad Talal Salman from Assafir and Professor Amer Abdo Mouawad from the American University of Beirut (AUB) for sharing Assafir data, which was instrumental to the work presented in this paper.