NAKBA NLP 2026: Arabic Manuscript Understanding Shared Task
Apply to Task · Subtask 1 CodaBench · Subtask 2 CodaBench · Baseline GitHub · Baseline HuggingFace · Contact Us
Introduction
Automated understanding of Arabic manuscripts is a cornerstone for opening up historical archives, personal memoirs, and cultural heritage collections to large-scale search, analysis, and digital humanities research. Yet, robust OCR and transcription for Arabic manuscripts remain challenging due to handwritten variation, page degradation, complex layouts, and the rich morphology of the Arabic language.
This shared task advances research on Arabic manuscript OCR and transcription by providing an open, curated benchmark of high-resolution manuscript page images paired with expert-verified line-level transcriptions. Participants will receive a dataset from the Omar Al-Saleh Memoir Collection, spanning 16 documents from 1951 to 1965, with approximately 1,597,025 words, 50,685 sentences, 50,672 paragraphs, and an estimated 6,395 pages.
Teams will compete in two complementary tracks:
- Transcription Track that enriches the corpus with high-quality manual transcriptions for unseen pages
- Systems Track that develops and evaluates automatic transcription models for Arabic manuscripts
The task encourages collaboration between digital humanities, manuscript studies, and Arabic NLP, and aims to release methods, results, and datasets openly in support of sustainable progress in Arabic manuscript OCR.
Subtask 1: Transcription Track (Manual Ground Truth Enrichment)
Objective
This subtask focuses on creating expert-quality, line-by-line transcriptions for a subset of manuscript pages that are not yet fully transcribed. The goal is to enrich the benchmark with reliable ground truth that can be used to train and evaluate OCR/HTR (Handwritten Text Recognition) systems for Arabic manuscripts.
Dataset
Participating teams will be provided with high-resolution page images sampled from the Omar Al-Saleh memoir collection. The full collection comprises 16 documents covering the years 1951–1965, with an estimated 6,395 pages, 1,597,025 words, 50,685 sentences, and 50,672 paragraphs.
For Subtask 1, each team will receive:
- One assigned batch (mandatory): A batch folder containing ~500 line images that must be transcribed
- Access to additional batches (bonus): Teams who wish to contribute more transcriptions may request access to additional batches beyond their assigned one
Each batch folder contains:
images/folder with cropped line images requiring transcriptionannotations.csvtemplate file to be filled with transcriptions
Additionally, the full unannotated dataset with complete page images will be provided, allowing participants to view the full page context when transcribing individual lines. The source_image column in the CSV references which full page each line comes from.
Detailed guidelines and examples (including complex lines and corner cases)
Transcription Requirements
- Transcriptions must be manual and line-aligned (one text line per image filename)
- Participants should follow the provided orthographic conventions (e.g., normalization rules, punctuation handling)
- No generative AI tools may be used to automatically generate or “correct” transcriptions, fully or partially, directly or indirectly
- Ambiguities (unclear characters, damaged areas) should be handled according to the authors’ own guidelines and should be fully described in their system paper
Submission Format
Participants will receive an annotations.csv template with the following columns:
- filename: Image filename (e.g.,
1960_p149_l0055.png) - text: Transcription (empty in template – participants fill this)
- source_image: Reference to the full page image for context (e.g.,
1b363a3319eb499aab16e5460d6115ed-0149.jpg) - year: Document year (e.g.,
1960,1953a) - page: Page number
- line: Line number
To submit: Fill in the text column with your manual transcription for each line, then upload the completed CSV to CodaBench.
Submissions will be automatically validated to ensure all required lines are transcribed. Incomplete submissions (empty text fields) will be flagged.
Evaluation
Submissions will be evaluated on:
- Coverage – Number of pages and lines transcribed (more coverage is better)
- Accuracy – Agreement with an internal, expert-verified ground truth using character-level and word-level edit distance metrics (e.g., CER, WER)
- Guidelines Document – Each team must submit a short guidelines document describing:
- Main and corner cases (with examples)
- How ambiguous or damaged text is handled
- Consistency measures across annotators
- Ethical and historical considerations
- Training/support procedures
The more robust and clearly documented the transcription guidelines, the higher the score on the guidelines criterion.
Subtask 2: Systems Track (Automatic Manuscript OCR/HTR)
Objective
The goal of this subtask is to develop automatic systems that transcribe Arabic manuscript page images into machine-readable text. Participants may use purely visual models, sequence models, or multimodal approaches, and may fine-tune on the provided training data or explore zero- and few-shot strategies.
Dataset
For Subtask 2, we provide pre-split datasets sampled from the Omar Al-Saleh manuscript collection:
- Training set: ~15,962 line images with gold transcriptions for model training
- Development set: ~1,774 line images with gold transcriptions for local validation
- Test set: ~2,095 line images (transcriptions held out for evaluation)
Each split contains:
images/folder with cropped line images (.jpgformat)annotations.csvfile with columns:- image: Image filename (e.g.,
1b363a3319eb499aab16e5460d6115ed-0004-22.jpg) - text: Gold transcription (provided for training/dev, empty for test submissions)
- image: Image filename (e.g.,
Additionally, the full unannotated dataset with complete page images will be available, allowing participants to leverage full page context or train on additional data.
Data will be shared via email to participants (Dropbox, GDrive, S3). Other options include gated HuggingFace repositories, or through CodaBench itself (after participant is approved).
Submission Format
Participants should submit a CSV file containing the automatically generated transcriptions for the test set, with the following columns:
- image or filename: Image filename (must match the provided test images exactly)
- text: Model-generated transcription
The CSV file should be uploaded to CodaBench. CodaBench competition links will be shared here.
Evaluation Metrics
- Character Error Rate (CER) – Primary metric, computed as normalized edit distance between predicted and gold strings
- Word Error Rate (WER) – Secondary metric, capturing word-level transcription quality
Baseline Systems
We run a zero-shot baseline with an open source vision and language OCR model, using DeepSeek OCR and/or Qwen3 VL on line images without any fine-tuning. For each line_id, we pass the cropped line image and record the raw text output, without post-correction. We report Character Error Rate as primary and Word Error Rate as secondary, against gold transcriptions. This baseline measures out-of-the-box performance on Arabic manuscript HTR from newly available open-source vision models.
Guidelines for Participating Teams
- Participants may choose to participate in one or both subtasks (Transcription Track and Systems Track)
- All participants must register through the official website to receive access to datasets and updates
- Upon requesting access to the data, participants must agree to submit a system description paper (4 pages) detailing their approach, methodology, data usage (including any external data), and findings
- Submissions will be peer-reviewed, and selected papers will be published in the proceedings/venue
- Participants are required to create an account on the selected submission platform (e.g., OpenReview) for paper submission and review processes
- All contributed transcriptions from all participants (where licenses permit) will be published in a shared GitHub repository under a license (e.g., CC-BY-4.0)
Important Dates
-
January 1
Call for Participation (Subtask 1 & Subtask 2) -
January 10
Release of Transcription Data and Evaluation Phase Begins (Subtask 1)
Release of Training Data (Subtask 2) -
February 10
Evaluation Period Begins (Subtask 2 – Test set released, blinded samples) -
February 17
Evaluation Phase Ends (Subtask 1)
Evaluation Period Ends (Subtask 2) -
February 21
Release of Results (Subtask 1 & Subtask 2) -
March 1
System Paper Submission Deadline (Participating Teams) -
March 15
Notification of Acceptance of System Papers -
March 21
Camera-Ready Paper Deadline
Organizers
- Fadi Zaraket · fadi.zaraket@dohainstitute.edu.qa · Arab Center for Research and Policy Studies / American University of Beirut
- Bilal Shalash · bilal.shalash@dohainstitute.org · Arab Center for Research and Policy Studies
- Hadi Hamoud · hhamoud@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
- Ahmad Chamseddine · achamsed@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
- Firas Ben Abid · firas.ben.abid@zinki.ai · Zinki AI
- Mustafa Jarrar · mjarrar@birzeit.edu · Hamad Bin Khalifa University / Birzeit University
- Chadi Abou Chakra · cabouchakr@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
- Andrew Naaem · anaaem@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
- Bernard Ghanem · bernard.ghanem@kaust.edu.sa · King Abdullah University of Science and Technology
- Monther Salahat · msalahat@birzeit.edu · Birzeit University
Contact Us
For inquiries, please contact us at: ar-ms@dohainstitute.edu.qa