NAKBA NLP 2026: Arabic Manuscript Understanding Shared Task

Apply to Task · Subtask 1 CodaBench · Subtask 2 CodaBench · Baseline GitHub · Baseline HuggingFace · Dataset: Segments · Dataset: Full Pages · Contact Us

Introduction

Automated understanding of Arabic manuscripts is a cornerstone for opening up historical archives, personal memoirs, and cultural heritage collections to large-scale search, analysis, and digital humanities research. Yet, robust OCR and transcription for Arabic manuscripts remain challenging due to handwritten variation, page degradation, complex layouts, and the rich morphology of the Arabic language.

This shared task, organized as part of LREC-COLING 2026, advanced research on Arabic manuscript OCR and transcription by providing an open, curated benchmark of high-resolution manuscript page images paired with expert-verified line-level transcriptions. The task used the Omar Al-Saleh Memoir Collection, spanning 16 documents from 1951 to 1965, with approximately 1,597,025 words, 50,685 sentences, 50,672 paragraphs, and 6,395 pages.

Teams will compete in two complementary tracks:

Transcription Track that enriches the corpus with high-quality manual transcriptions for unseen pages
Systems Track that develops and evaluates automatic transcription models for Arabic manuscripts

The task encourages collaboration between digital humanities, manuscript studies, and Arabic NLP, and releases methods, results, and datasets openly in support of sustainable progress in Arabic manuscript OCR.

Results, Leaderboard, and Winners

The shared task has now concluded.

Winners

Category	Winner	Notes
Subtask 1 (Transcription Track)	PalNLP	CER 0.061, WER 0.285, 499/500 completed lines
Subtask 2 (Systems Track, official corpus-level ranking)	Misraj AI	CER 0.079, WER 0.244
Subtask 2 (best per-line result)	Ketaba OCR	CER 0.082, WER 0.259

Subtask 1 Leaderboard — Transcription Track

Rank	Team	Valid Transcriptions	Completion	Hidden Test Samples	CER	WER
1	PalNLP	499	99.8%	49/50	0.061	0.285
2	Independent	497	99.4%	47/50	0.086	0.366
3	Sard	401	80.0%	39/50	0.114	0.433

Subtask 2 Leaderboard — Systems Track (Corpus-Level Official Ranking)

Rank	Team	CER	WER
1	Misraj AI	0.079	0.244
2	Oblevit	0.093	0.327
3	Ketaba OCR	0.094	0.300
4	Latent Narratives	0.105	0.311
5	Al-Warraq	0.114	0.378
6	Not Gemma	0.122	0.306
7	Fahras	0.227	0.522
—	Baseline	0.368	0.691

Subtask 2 Per-Line Results

Rank	Team	CER	WER
1	Ketaba OCR	0.082	0.259
2	Misraj AI	0.090	0.252
3	Latent Narratives	0.100	0.285
4	Oblevit	0.105	0.332
5	Al-Warraq	0.106	0.347
6	Not Gemma	0.110	0.313
7	Fahras	0.182	0.430
—	Baseline	0.281	0.588

All contributed transcriptions and system outputs are released under CC-BY-4.0.

Datasets

Line segments dataset: U4RASD/omar-al-saleh-manuscripts-segments
Full manuscript pages dataset: U4RASD/omar-al-saleh-manuscripts-full

Subtask 1: Transcription Track (Manual Ground Truth Enrichment)

Objective

This subtask focuses on creating expert-quality, line-by-line transcriptions for a subset of manuscript pages that are not yet fully transcribed. The goal is to enrich the benchmark with reliable ground truth that can be used to train and evaluate OCR/HTR (Handwritten Text Recognition) systems for Arabic manuscripts.

Dataset

Participating teams will be provided with high-resolution page images sampled from the Omar Al-Saleh memoir collection. The full collection comprises 16 documents covering the years 1951–1965, with an estimated 6,395 pages, 1,597,025 words, 50,685 sentences, and 50,672 paragraphs.

For Subtask 1, each team will receive:

One assigned batch (mandatory): A batch folder containing ~500 line images that must be transcribed
Access to additional batches (bonus): Teams who wish to contribute more transcriptions may request access to additional batches beyond their assigned one

Each batch folder contains:

images/ folder with cropped line images requiring transcription
annotations.csv template file to be filled with transcriptions

Additionally, the full unannotated dataset with complete page images will be provided, allowing participants to view the full page context when transcribing individual lines. The source_image column in the CSV references which full page each line comes from.

Detailed guidelines and examples (including complex lines and corner cases)

Transcription Requirements

Transcriptions must be manual and line-aligned (one text line per image filename)
Participants should follow the provided orthographic conventions (e.g., normalization rules, punctuation handling)
No generative AI tools may be used to automatically generate or “correct” transcriptions, fully or partially, directly or indirectly
Ambiguities (unclear characters, damaged areas) should be handled according to the authors’ own guidelines and should be fully described in their system paper

Submission Format

Participants will receive an annotations.csv template with the following columns:

filename: Image filename (e.g., 1960_p149_l0055.png)
text: Transcription (empty in template – participants fill this)
source_image: Reference to the full page image for context (e.g., 1b363a3319eb499aab16e5460d6115ed-0149.jpg)
year: Document year (e.g., 1960, 1953a)
page: Page number
line: Line number

To submit: Fill in the text column with your manual transcription for each line, then upload the completed CSV to CodaBench.

Submissions will be automatically validated to ensure all required lines are transcribed. Incomplete submissions (empty text fields) will be flagged.

Evaluation

Submissions will be evaluated on:

Coverage – Number of pages and lines transcribed (more coverage is better)
Accuracy – Agreement with an internal, expert-verified ground truth using character-level and word-level edit distance metrics (e.g., CER, WER)
Guidelines Document – Each team must submit a short guidelines document describing:
- Main and corner cases (with examples)
- How ambiguous or damaged text is handled
- Consistency measures across annotators
- Ethical and historical considerations
- Training/support procedures

The more robust and clearly documented the transcription guidelines, the higher the score on the guidelines criterion.

Subtask 2: Systems Track (Automatic Manuscript OCR/HTR)

Objective

The goal of this subtask is to develop automatic systems that transcribe Arabic manuscript page images into machine-readable text. Participants may use purely visual models, sequence models, or multimodal approaches, and may fine-tune on the provided training data or explore zero- and few-shot strategies.

Dataset

For Subtask 2, we provide pre-split datasets sampled from the Omar Al-Saleh manuscript collection:

Training set: ~15,962 line images with gold transcriptions for model training
Development set: ~1,774 line images with gold transcriptions for local validation
Test set: ~2,671 line images (transcriptions held out for evaluation)

Each split contains:

images/ folder with cropped line images (.jpg format)
annotations.csv file with columns:
- image: Image filename (e.g., 1b363a3319eb499aab16e5460d6115ed-0004-22.jpg)
- text: Gold transcription (provided for training/dev, empty for test submissions)

Additionally, the full unannotated dataset with complete page images will be available, allowing participants to leverage full page context or train on additional data.

Data will be shared via email to participants (Dropbox, GDrive, S3). Other options include gated HuggingFace repositories, or through CodaBench itself (after participant is approved).

Submission Format

Participants should submit a CSV file containing the automatically generated transcriptions for the test set, with the following columns:

image or filename: Image filename (must match the provided test images exactly)
text: Model-generated transcription

The CSV file should be uploaded to CodaBench. CodaBench competition links will be shared here.

Evaluation Metrics

Character Error Rate (CER) – Primary metric, computed as normalized edit distance between predicted and gold strings
Word Error Rate (WER) – Secondary metric, capturing word-level transcription quality

Baseline Systems

The organizer baseline uses Qwen3-VL-8B-Instruct with LoRA fine-tuning on the shared-task training split. On the held-out test set, the baseline achieves 0.368 CER and 0.691 WER at corpus level. All seven submitted systems outperformed this baseline, with Misraj AI achieving the best official corpus-level result at 0.079 CER and 0.244 WER.

Guidelines for Participating Teams

Participants may choose to participate in one or both subtasks (Transcription Track and Systems Track)
All participants must register through the official website to receive access to datasets and updates
Upon requesting access to the data, participants must agree to submit a system description paper (4 pages) detailing their approach, methodology, data usage (including any external data), and findings
Submissions will be peer-reviewed, and selected papers will be published in the proceedings/venue
Participants are required to create an account on the selected submission platform (e.g., OpenReview) for paper submission and review processes
All contributed transcriptions from all participants (where licenses permit) will be published in a shared GitHub repository under a license (e.g., CC-BY-4.0)

Important Dates

January 1
Call for Participation (Subtask 1 & Subtask 2)
January 10
Release of Transcription Data and Evaluation Phase Begins (Subtask 1)
Release of Training Data (Subtask 2)
February 10
Evaluation Period Begins (Subtask 2 – Test set released, blinded samples)
February 17
Evaluation Phase Ends (Subtask 1)
Evaluation Period Ends (Subtask 2)
February 21
Release of Results (Subtask 1 & Subtask 2)
March 1
System Paper Submission Deadline (Participating Teams)
March 15
Notification of Acceptance of System Papers
March 21
Camera-Ready Paper Deadline

Organizers

Fadi Zaraket · fadi.zaraket@dohainstitute.edu.qa · Arab Center for Research and Policy Studies / American University of Beirut
Bilal Shalash · bilal.shalash@dohainstitute.org · Arab Center for Research and Policy Studies
Hadi Hamoud · hhamoud@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
Ahmad Chamseddine · achamsed@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
Firas Ben Abid · firas.ben.abid@zinki.ai · Zinki AI
Mustafa Jarrar · mjarrar@birzeit.edu · Hamad Bin Khalifa University / Birzeit University
Chadi Abou Chakra · cabouchakr@dohainstitute.edu.qa · Arab Center for Research and Policy Studies
Bernard Ghanem · bernard.ghanem@kaust.edu.sa · King Abdullah University of Science and Technology

Contact Us

For inquiries, please contact us at: ar-ms@dohainstitute.edu.qa