NAKBA NLP 2026: Arabic Manuscript Understanding Shared Task

Apply to Task  ·  Subtask 1 CodaBench  ·  Subtask 2 CodaBench  ·  Baseline GitHub  ·  Baseline HuggingFace  ·  Contact Us

Introduction

Automated understanding of Arabic manuscripts is a cornerstone for opening up historical archives, personal memoirs, and cultural heritage collections to large-scale search, analysis, and digital humanities research. Yet, robust OCR and transcription for Arabic manuscripts remain challenging due to handwritten variation, page degradation, complex layouts, and the rich morphology of the Arabic language.

This shared task advances research on Arabic manuscript OCR and transcription by providing an open, curated benchmark of high-resolution manuscript page images paired with expert-verified line-level transcriptions. Participants will receive a dataset from the Omar Al-Saleh Memoir Collection, spanning 16 documents from 1951 to 1965, with approximately 1,597,025 words, 50,685 sentences, 50,672 paragraphs, and an estimated 6,395 pages.

Teams will compete in two complementary tracks:

  1. Transcription Track that enriches the corpus with high-quality manual transcriptions for unseen pages
  2. Systems Track that develops and evaluates automatic transcription models for Arabic manuscripts

The task encourages collaboration between digital humanities, manuscript studies, and Arabic NLP, and aims to release methods, results, and datasets openly in support of sustainable progress in Arabic manuscript OCR.

Subtask 1: Transcription Track (Manual Ground Truth Enrichment)

Objective

This subtask focuses on creating expert-quality, line-by-line transcriptions for a subset of manuscript pages that are not yet fully transcribed. The goal is to enrich the benchmark with reliable ground truth that can be used to train and evaluate OCR/HTR (Handwritten Text Recognition) systems for Arabic manuscripts.

Dataset

Participating teams will be provided with high-resolution page images sampled from the Omar Al-Saleh memoir collection. The full collection comprises 16 documents covering the years 1951–1965, with an estimated 6,395 pages, 1,597,025 words, 50,685 sentences, and 50,672 paragraphs.

For Subtask 1, each team will receive:

  • One assigned batch (mandatory): A batch folder containing ~500 line images that must be transcribed
  • Access to additional batches (bonus): Teams who wish to contribute more transcriptions may request access to additional batches beyond their assigned one

Each batch folder contains:

  • images/ folder with cropped line images requiring transcription
  • annotations.csv template file to be filled with transcriptions

Additionally, the full unannotated dataset with complete page images will be provided, allowing participants to view the full page context when transcribing individual lines. The source_image column in the CSV references which full page each line comes from.

Detailed guidelines and examples (including complex lines and corner cases)

Transcription Requirements

  • Transcriptions must be manual and line-aligned (one text line per image filename)
  • Participants should follow the provided orthographic conventions (e.g., normalization rules, punctuation handling)
  • No generative AI tools may be used to automatically generate or “correct” transcriptions, fully or partially, directly or indirectly
  • Ambiguities (unclear characters, damaged areas) should be handled according to the authors’ own guidelines and should be fully described in their system paper

Submission Format

Participants will receive an annotations.csv template with the following columns:

  • filename: Image filename (e.g., 1960_p149_l0055.png)
  • text: Transcription (empty in template – participants fill this)
  • source_image: Reference to the full page image for context (e.g., 1b363a3319eb499aab16e5460d6115ed-0149.jpg)
  • year: Document year (e.g., 1960, 1953a)
  • page: Page number
  • line: Line number

To submit: Fill in the text column with your manual transcription for each line, then upload the completed CSV to CodaBench.

Submissions will be automatically validated to ensure all required lines are transcribed. Incomplete submissions (empty text fields) will be flagged.

Evaluation

Submissions will be evaluated on:

  1. Coverage – Number of pages and lines transcribed (more coverage is better)
  2. Accuracy – Agreement with an internal, expert-verified ground truth using character-level and word-level edit distance metrics (e.g., CER, WER)
  3. Guidelines Document – Each team must submit a short guidelines document describing:
    • Main and corner cases (with examples)
    • How ambiguous or damaged text is handled
    • Consistency measures across annotators
    • Ethical and historical considerations
    • Training/support procedures

The more robust and clearly documented the transcription guidelines, the higher the score on the guidelines criterion.

Subtask 2: Systems Track (Automatic Manuscript OCR/HTR)

Objective

The goal of this subtask is to develop automatic systems that transcribe Arabic manuscript page images into machine-readable text. Participants may use purely visual models, sequence models, or multimodal approaches, and may fine-tune on the provided training data or explore zero- and few-shot strategies.

Dataset

For Subtask 2, we provide pre-split datasets sampled from the Omar Al-Saleh manuscript collection:

  • Training set: ~15,962 line images with gold transcriptions for model training
  • Development set: ~1,774 line images with gold transcriptions for local validation
  • Test set: ~2,095 line images (transcriptions held out for evaluation)

Each split contains:

  • images/ folder with cropped line images (.jpg format)
  • annotations.csv file with columns:
    • image: Image filename (e.g., 1b363a3319eb499aab16e5460d6115ed-0004-22.jpg)
    • text: Gold transcription (provided for training/dev, empty for test submissions)

Additionally, the full unannotated dataset with complete page images will be available, allowing participants to leverage full page context or train on additional data.

Data will be shared via email to participants (Dropbox, GDrive, S3). Other options include gated HuggingFace repositories, or through CodaBench itself (after participant is approved).

Submission Format

Participants should submit a CSV file containing the automatically generated transcriptions for the test set, with the following columns:

  • image or filename: Image filename (must match the provided test images exactly)
  • text: Model-generated transcription

The CSV file should be uploaded to CodaBench. CodaBench competition links will be shared here.

Evaluation Metrics

  1. Character Error Rate (CER) – Primary metric, computed as normalized edit distance between predicted and gold strings
  2. Word Error Rate (WER) – Secondary metric, capturing word-level transcription quality

Baseline Systems

We run a zero-shot baseline with an open source vision and language OCR model, using DeepSeek OCR and/or Qwen3 VL on line images without any fine-tuning. For each line_id, we pass the cropped line image and record the raw text output, without post-correction. We report Character Error Rate as primary and Word Error Rate as secondary, against gold transcriptions. This baseline measures out-of-the-box performance on Arabic manuscript HTR from newly available open-source vision models.

Guidelines for Participating Teams

  • Participants may choose to participate in one or both subtasks (Transcription Track and Systems Track)
  • All participants must register through the official website to receive access to datasets and updates
  • Upon requesting access to the data, participants must agree to submit a system description paper (4 pages) detailing their approach, methodology, data usage (including any external data), and findings
  • Submissions will be peer-reviewed, and selected papers will be published in the proceedings/venue
  • Participants are required to create an account on the selected submission platform (e.g., OpenReview) for paper submission and review processes
  • All contributed transcriptions from all participants (where licenses permit) will be published in a shared GitHub repository under a license (e.g., CC-BY-4.0)

Important Dates

  • January 1
    Call for Participation (Subtask 1 & Subtask 2)

  • January 10
    Release of Transcription Data and Evaluation Phase Begins (Subtask 1)
    Release of Training Data (Subtask 2)

  • February 10
    Evaluation Period Begins (Subtask 2 – Test set released, blinded samples)

  • February 17
    Evaluation Phase Ends (Subtask 1)
    Evaluation Period Ends (Subtask 2)

  • February 21
    Release of Results (Subtask 1 & Subtask 2)

  • March 1
    System Paper Submission Deadline (Participating Teams)

  • March 15
    Notification of Acceptance of System Papers

  • March 21
    Camera-Ready Paper Deadline

Organizers

Contact Us

For inquiries, please contact us at: ar-ms@dohainstitute.edu.qa