The DALLA Suite: Open Tools for Arabic Large Language Models Development
Large Language Models (LLMs) are rapidly advancing and now power many of the technologies we use. However, most progress has centered on high-resource languages, especially English, making these models more effective for those languages and more reflective of their cultural contexts. In contrast, lower-resource languages still face significant gaps in data, technology, and research attention.
For Arabic, these challenges are even more significant. Its already limited resources, combined with rich morphology, complex grammar, and the phenomenon of diglossia (the coexistence of Modern Standard Arabic and diverse dialects), make building strong Arabic-centric LLMs especially challenging. Although recent Arabic open-weight model releases have helped, replicating their full training process remains difficult due to the lack of end-to-end, open-source tooling specifically designed for building Arabic LLMs.
In an effort to fill this gap, we are excited to introduce the DALLA suite, a comprehensive, fully open set of tools dedicated to developing Arabic Large Language Models. The DALLA suite supports the entire adaptation workflow for open-weight LLMs, including:
- a robust Arabic data-processing pipeline,
- practical tokenizer-adaptation tools that enhance Arabic coverage and model efficiency, and
- an easy-to-use training recipe that ties all components together.
To demonstrate the usage of the DALLA suite, we are also releasing two DALLA open-weight Arabic-adapted models, based on Gemma 2 9B and Llama 3.1 8B.
1. Data Processing
High-quality Arabic data is essential for building effective Arabic LLMs, yet most available corpora have duplicates, inconsistent quality, and varying levels of linguistic complexity. The DALLA suite includes a complete data-processing pipeline designed to address these challenges efficiently and transparently.
Deduplication
We provide a deduplication module that is a modified version of the Onion (ONe Instance ONly) deduplication tool that can process documents from multiple directories rather than requiring a single combined file. This makes it easy to clean large, multi-source Arabic datasets. Our version also has enhanced duplicate source detection and reports exactly which files are duplicated across directories, helping users make informed decisions about which sources to keep.
Quality Checking with Morphological Analysis
To help users filter low-quality or noisy data, the suite includes a document-level quality checker that utilizes CAMeL Tools morphological disambiguation. It identifies issues such as words with no morphological analysis and foreign words whose gloss matches the word itself. For each document, it computes a quality score based on the total words and total errors where higher values indicate cleaner text. The system also tracks all problematic words and their frequencies for later inspection and ignores numbers and punctuation to avoid false positives. It supports both the MLE and BERT disambiguation models and appends all error counts and percentages as metadata alongside the original text.
Stemming with Light Morphological Segmentation
The DALLA suite includes a stemming module that performs light morphological segmentation using CAMeL Tools. It breaks words into their morphological components and reconstructs them using the most reliable of three segmentation schemes (d3tok, bwtok, and d3seg). A separator token, which defaults to <+>, is used to mark the boundaries between morphological components, ensuring clear and consistent segmentation across the dataset. Stemming is especially useful for Arabic, where rich morphology creates many surface forms of the same word; reducing this variation helps the model learn more effectively.
To improve coverage, we expanded CAMeL’s morphology database by manually annotating the top 5,000 words in our corpus that CAMeL could not analyze. When these words appear during processing, the system applies the corresponding verified segmentation pattern automatically.
The module also supports optional diacritic preservation and text normalization. Depending on configuration, it can output stemmed text with original diacritics preserved or fully dediacritized versions. Additional normalization steps, such as unifying alef variants and normalizing taa marbouta, helping reduce vocabulary size.
Readability Scoring with Dual Metrics
To support dataset curation and curriculum-based training, the suite includes a readability scoring module that computes readability scores for each document using both the Flesch Reading Ease score and the Osman score. Documents are then ranked and assigned difficulty levels ranging from “very easy” to “very difficult” using each score. We provide a final readability level that combines both metrics enabling users to filter or structure data based on linguistic complexity.
2. Tokenizer Adaptation
Most LLM tokenizers are optimized for high-resource languages, which leaves languages like Arabic underrepresented. As a result, common Arabic words often get split into many subword units, increasing sequence lengths and hurting model performance. To address this, models are typically adapted to target languages through vocabulary extension or full vocabulary replacement. While they can be effective, both approaches increase model size and/or computational cost.
In the DALLA suite, we introduce a lightweight alternative: token reuse. Instead of expanding the vocabulary, token reuse repurposes token IDs belonging to user-excluded languages and reassigns them to new Arabic tokens. User-excluded languages are languages that the model is not intended to support in deployment. For example, if a model will only be used for Arabic and English, token IDs associated with all other languages can be safely reused for new Arabic tokens instead. This preserves the original tokenizer structure and embedding size while improving Arabic coverage. This method is based on two papers from our team, one introducing token reuse for SentencePiece tokenizers and another for BPE tokenizers.
We provide two easy-to-use tools implementing token reuse, one for SentencePiece and one for BPE tokenizers, making it simple to adapt HuggingFace-based models to Arabic without increasing computational costs.
3. Dataset Packing for High-Efficiency Training
During LLM training, packing multiple tokenized samples into a single sequence utilizing the maximum sequence length of the model is essential for reducing training time. However, most existing packing implementations, including those in common frameworks like HuggingFace, often split or truncate samples across sequences. This can reduce the quality and consistency of the training data.
The DALLA suite includes a simple and reliable packing module that works seamlessly with any HuggingFace-based tokenizer, including the token reuse-based tokenizers we introduced. The module keeps every sample intact by filling each training sequence with as many complete samples as possible. When a sample does not fit in the remaining space, the sequence is padded and the sample is moved to the next one. No samples are ever cut or split. This approach leads to faster training and fully reproducible dataset shards across experiments.
4. Training Recipe
To make the entire pipeline easy to use end-to-end, we are also releasing the [training recipe] (https://github.com/U4RASD/dalla-model-training) used to build the DALLA models. The recipe integrates all components of the suite, including the data-processing pipeline, the tokenizer-adaptation tools, the packing module, and customizable layer-freezing options.
DALLA: Doha Arabic Large Language Models
With all the components of the DALLA suite working together, from data cleaning and tokenizer adaptation to packing and training, we built the first generation of DALLA models by adapting two open-weight models: Gemma 2 9B and Llama 3.1 8B. These models illustrate how the suite enables reproducible, end-to-end pipelines for adapting open-weight LLMs to handle Arabic more efficiently and with better cultural alignment to Arab communities.
1. Training Data Corpus
To build and adapt the DALLA models, we curated a collection of high-quality Arabic datasets tailored for each stage of the pipeline: tokenizer training, continued pretraining, and supervised fine-tuning. While the combined dataset sizes remain relatively small compared to the massive corpora used to train most general-purpose LLMs, this initial release focuses on curated, culturally meaningful Arabic content.
Tokenizer Training
We built a tokenizer-training corpus that includes Middle Eastern school curricula, Arab Center for Research and Policy Studies (ACRPS) book publications (2012–2024), covering topics in society, governance, culture, economics, and public policy. We also incorporated Arabic Wikipedia. This data mix provides a broad and representative foundation for Arabic tokenization. The tokenizer-training corpus contains 447M tokens, which expand to 721M tokens after stemming.
Continued Pretraining
For continued pretraining, we expanded the tokenizer-training dataset with the Doha Historical Dictionary of Arabic Language (DHD) corpus, the public Hindawi books dataset, the Open Islamicate Texts Initiative corpus, and a proprietary dataset focused on Islamic studies and contemporary social issues. Together, these sources form a culturally rich and diverse pretraining dataset with 3.7B tokens, or 5.8B tokens after stemming.
Supervised Fine-Tuning
For supervised fine-tuning, we created question–answer datasets from ACRPS publications, biographies, and school curricula. We also developed datasets from DHD, focusing on meanings, roots, lemmas, and parts of speech. In addition, we incorporated public datasets such as CIDAR, filtered Arabic-OpenHermes-2.5, and a translated version of everyday-conversations-llama3.1-2k. The final supervised fine-tuning dataset contains 2.4M examples.
2. Cultural Alignment
To assess how well the adapted models reflect values closer to Arab communities, we used an evaluation inspired by the World Values Survey (WVS), a global study of cultural and social values. We adapted the ten WVS questions used in the Inglehart–Welzel Cultural Map so they could be answered by language models while preserving the core value dimensions they measure:
Traditional vs. Secular-Rational and Survival vs. Self-Expression.
Using these adapted questions, we measured where each model falls on the same two value dimensions defined in the WVS. This evaluation enables us to see not only which cultures the base models resemble, but also how the DALLA-adapted versions shift toward value positions that are more aligned with Arab communities.
The figure illustrates how the DALLA models reposition the base models on the cultural map. After adaptation, the models move away from the value regions associated with Western countries and closer to the cluster of Arab countries, indicating improved cultural alignment and value consistency with Arab communities.
Final Thoughts
Our goal with the DALLA suite is to provide a practical, end-to-end foundation for developing Arabic-centric LLMs. Each component is designed to be modular, efficient, reproducible, and aware of the linguistic characteristics of Arabic.
We hope the suite serves as a useful starting point for researchers, developers, and anyone interested in advancing Arabic NLP. We will continue improving and expanding both the tools and the DALLA models themselves, and we warmly welcome collaboration and community feedback as we move forward.