NAKBA NLP 2026: Arabic Manuscript Understanding Shared Task

Apply to Task  ·  Subtask 1 CodaBench  ·  Subtask 2 CodaBench  ·  Baseline GitHub  ·  Baseline HuggingFace  ·  Dataset: Segments  ·  Dataset: Full Pages  ·  Contact Us

Introduction

Automated understanding of Arabic manuscripts is a cornerstone for opening up historical archives, personal memoirs, and cultural heritage collections to large-scale search, analysis, and digital humanities research. Yet, robust OCR and transcription for Arabic manuscripts remain challenging due to handwritten variation, page degradation, complex layouts, and the rich morphology of the Arabic language.

This shared task, organized as part of LREC-COLING 2026, advanced research on Arabic manuscript OCR and transcription by providing an open, curated benchmark of high-resolution manuscript page images paired with expert-verified line-level transcriptions. The task used the Omar Al-Saleh Memoir Collection, spanning 16 documents from 1951 to 1965, with approximately 1,597,025 words, 50,685 sentences, 50,672 paragraphs, and 6,395 pages.

Teams will compete in two complementary tracks:

  1. Transcription Track that enriches the corpus with high-quality manual transcriptions for unseen pages
  2. Systems Track that develops and evaluates automatic transcription models for Arabic manuscripts

The task encourages collaboration between digital humanities, manuscript studies, and Arabic NLP, and releases methods, results, and datasets openly in support of sustainable progress in Arabic manuscript OCR.

Results, Leaderboard, and Winners

The shared task has now concluded.

Winners

CategoryWinnerNotes
Subtask 1 (Transcription Track)PalNLPCER 0.061, WER 0.285, 499/500 completed lines
Subtask 2 (Systems Track, official corpus-level ranking)Misraj AICER 0.079, WER 0.244
Subtask 2 (best per-line result)Ketaba OCRCER 0.082, WER 0.259

Subtask 1 Leaderboard — Transcription Track

RankTeamValid TranscriptionsCompletionHidden Test SamplesCERWER
1PalNLP49999.8%49/500.0610.285
2Independent49799.4%47/500.0860.366
3Sard40180.0%39/500.1140.433

Subtask 2 Leaderboard — Systems Track (Corpus-Level Official Ranking)

RankTeamCERWER
1Misraj AI0.0790.244
2Oblevit0.0930.327
3Ketaba OCR0.0940.300
4Latent Narratives0.1050.311
5Al-Warraq0.1140.378
6Not Gemma0.1220.306
7Fahras0.2270.522
Baseline0.3680.691

Subtask 2 Per-Line Results

RankTeamCERWER
1Ketaba OCR0.0820.259
2Misraj AI0.0900.252
3Latent Narratives0.1000.285
4Oblevit0.1050.332
5Al-Warraq0.1060.347
6Not Gemma0.1100.313
7Fahras0.1820.430
Baseline0.2810.588

All contributed transcriptions and system outputs are released under CC-BY-4.0.

Datasets

Subtask 1: Transcription Track (Manual Ground Truth Enrichment)

Objective

This subtask focuses on creating expert-quality, line-by-line transcriptions for a subset of manuscript pages that are not yet fully transcribed. The goal is to enrich the benchmark with reliable ground truth that can be used to train and evaluate OCR/HTR (Handwritten Text Recognition) systems for Arabic manuscripts.

Dataset

Participating teams will be provided with high-resolution page images sampled from the Omar Al-Saleh memoir collection. The full collection comprises 16 documents covering the years 1951–1965, with an estimated 6,395 pages, 1,597,025 words, 50,685 sentences, and 50,672 paragraphs.

For Subtask 1, each team will receive:

  • One assigned batch (mandatory): A batch folder containing ~500 line images that must be transcribed
  • Access to additional batches (bonus): Teams who wish to contribute more transcriptions may request access to additional batches beyond their assigned one

Each batch folder contains:

  • images/ folder with cropped line images requiring transcription
  • annotations.csv template file to be filled with transcriptions

Additionally, the full unannotated dataset with complete page images will be provided, allowing participants to view the full page context when transcribing individual lines. The source_image column in the CSV references which full page each line comes from.

Detailed guidelines and examples (including complex lines and corner cases)

Transcription Requirements

  • Transcriptions must be manual and line-aligned (one text line per image filename)
  • Participants should follow the provided orthographic conventions (e.g., normalization rules, punctuation handling)
  • No generative AI tools may be used to automatically generate or “correct” transcriptions, fully or partially, directly or indirectly
  • Ambiguities (unclear characters, damaged areas) should be handled according to the authors’ own guidelines and should be fully described in their system paper

Submission Format

Participants will receive an annotations.csv template with the following columns:

  • filename: Image filename (e.g., 1960_p149_l0055.png)
  • text: Transcription (empty in template – participants fill this)
  • source_image: Reference to the full page image for context (e.g., 1b363a3319eb499aab16e5460d6115ed-0149.jpg)
  • year: Document year (e.g., 1960, 1953a)
  • page: Page number
  • line: Line number

To submit: Fill in the text column with your manual transcription for each line, then upload the completed CSV to CodaBench.

Submissions will be automatically validated to ensure all required lines are transcribed. Incomplete submissions (empty text fields) will be flagged.

Evaluation

Submissions will be evaluated on:

  1. Coverage – Number of pages and lines transcribed (more coverage is better)
  2. Accuracy – Agreement with an internal, expert-verified ground truth using character-level and word-level edit distance metrics (e.g., CER, WER)
  3. Guidelines Document – Each team must submit a short guidelines document describing:
    • Main and corner cases (with examples)
    • How ambiguous or damaged text is handled
    • Consistency measures across annotators
    • Ethical and historical considerations
    • Training/support procedures

The more robust and clearly documented the transcription guidelines, the higher the score on the guidelines criterion.

Subtask 2: Systems Track (Automatic Manuscript OCR/HTR)

Objective

The goal of this subtask is to develop automatic systems that transcribe Arabic manuscript page images into machine-readable text. Participants may use purely visual models, sequence models, or multimodal approaches, and may fine-tune on the provided training data or explore zero- and few-shot strategies.

Dataset

For Subtask 2, we provide pre-split datasets sampled from the Omar Al-Saleh manuscript collection:

  • Training set: ~15,962 line images with gold transcriptions for model training
  • Development set: ~1,774 line images with gold transcriptions for local validation
  • Test set: ~2,671 line images (transcriptions held out for evaluation)

Each split contains:

  • images/ folder with cropped line images (.jpg format)
  • annotations.csv file with columns:
    • image: Image filename (e.g., 1b363a3319eb499aab16e5460d6115ed-0004-22.jpg)
    • text: Gold transcription (provided for training/dev, empty for test submissions)

Additionally, the full unannotated dataset with complete page images will be available, allowing participants to leverage full page context or train on additional data.

Data will be shared via email to participants (Dropbox, GDrive, S3). Other options include gated HuggingFace repositories, or through CodaBench itself (after participant is approved).

Submission Format

Participants should submit a CSV file containing the automatically generated transcriptions for the test set, with the following columns:

  • image or filename: Image filename (must match the provided test images exactly)
  • text: Model-generated transcription

The CSV file should be uploaded to CodaBench. CodaBench competition links will be shared here.

Evaluation Metrics

  1. Character Error Rate (CER) – Primary metric, computed as normalized edit distance between predicted and gold strings
  2. Word Error Rate (WER) – Secondary metric, capturing word-level transcription quality

Baseline Systems

The organizer baseline uses Qwen3-VL-8B-Instruct with LoRA fine-tuning on the shared-task training split. On the held-out test set, the baseline achieves 0.368 CER and 0.691 WER at corpus level. All seven submitted systems outperformed this baseline, with Misraj AI achieving the best official corpus-level result at 0.079 CER and 0.244 WER.

Guidelines for Participating Teams

  • Participants may choose to participate in one or both subtasks (Transcription Track and Systems Track)
  • All participants must register through the official website to receive access to datasets and updates
  • Upon requesting access to the data, participants must agree to submit a system description paper (4 pages) detailing their approach, methodology, data usage (including any external data), and findings
  • Submissions will be peer-reviewed, and selected papers will be published in the proceedings/venue
  • Participants are required to create an account on the selected submission platform (e.g., OpenReview) for paper submission and review processes
  • All contributed transcriptions from all participants (where licenses permit) will be published in a shared GitHub repository under a license (e.g., CC-BY-4.0)

Important Dates

  • January 1
    Call for Participation (Subtask 1 & Subtask 2)

  • January 10
    Release of Transcription Data and Evaluation Phase Begins (Subtask 1)
    Release of Training Data (Subtask 2)

  • February 10
    Evaluation Period Begins (Subtask 2 – Test set released, blinded samples)

  • February 17
    Evaluation Phase Ends (Subtask 1)
    Evaluation Period Ends (Subtask 2)

  • February 21
    Release of Results (Subtask 1 & Subtask 2)

  • March 1
    System Paper Submission Deadline (Participating Teams)

  • March 15
    Notification of Acceptance of System Papers

  • March 21
    Camera-Ready Paper Deadline

Organizers

Contact Us

For inquiries, please contact us at: ar-ms@dohainstitute.edu.qa