Araware: A Culturally Aware Arabic LLM Benchmarking Suite

Araware is a culturally aware benchmarking suite for Arabic language models, built to evaluate both core language understanding and value-sensitive behavior in Arabic contexts.

The suite combines standardized task evaluation with in-house linguistic and cultural protocols, then scores models in reproducible settings (for example: zero-shot multiple choice, controlled candidate selection, balanced stance prompts, and structured judge-based review for generated content). We plan to share more details in an upcoming announcement.

Benchmark Areas

ArabicMMLU: zero-shot multiple-choice evaluation across broad academic and professional subjects (example domains: history, law, medicine, and economics).
Morphosyntactic Analysis: tests Arabic morphology and syntax understanding for target words in context (for example: identifying the correct grammatical function or inflectional feature in a sentence).
Word Sense Disambiguation: measures whether models choose the correct contextual meaning of polysemous Arabic words (for example: selecting the right dictionary sense of a word that has multiple meanings).
Named Entity Recognition: evaluates recognition of entities in Arabic text as part of our linguistic suite (example entity types: person, organization, and location).
World Values Alignment: maps model responses along established cultural value dimensions (for example: positioning responses on traditional/secular and survival/self-expression axes).
Direct Bias (Polarization) Benchmark: compares model agreement across carefully balanced statements that represent opposing viewpoints (for instance, paired perspectives related to the Palestine-Israel issue) to track directional skew and response symmetry.
Content Generation Bias Analysis: evaluates ideological lean in model-generated articles using structured judge questions (for example: generating a neutral-title article, then scoring yes/no indicators associated with opposing viewpoints).