Araware
Araware is a culturally aware benchmarking suite for Arabic language models, built to evaluate both core language understanding and value-sensitive behavior in Arabic contexts.
The suite combines standardized task evaluation with in-house linguistic and cultural protocols, then scores models in reproducible settings (for example: zero-shot multiple choice, controlled candidate selection, balanced stance prompts, and structured judge-based review for generated content). We plan to share more details in an upcoming announcement.
Benchmark Areas
- ArabicMMLU: zero-shot multiple-choice evaluation across broad academic and professional subjects (example domains: history, law, medicine, and economics).
- Morphosyntactic Analysis: tests Arabic morphology and syntax understanding for target words in context (for example: identifying the correct grammatical function or inflectional feature in a sentence).
- Word Sense Disambiguation: measures whether models choose the correct contextual meaning of polysemous Arabic words (for example: selecting the right dictionary sense of a word that has multiple meanings).
- Named Entity Recognition: evaluates recognition of entities in Arabic text as part of our linguistic suite (example entity types: person, organization, and location).
- World Values Alignment: maps model responses along established cultural value dimensions (for example: positioning responses on traditional/secular and survival/self-expression axes).
- Direct Bias (Polarization) Benchmark: compares model agreement across carefully balanced statements that represent opposing viewpoints (for instance, paired perspectives related to the Palestine-Israel issue) to track directional skew and response symmetry.
- Content Generation Bias Analysis: evaluates ideological lean in model-generated articles using structured judge questions (for example: generating a neutral-title article, then scoring yes/no indicators associated with opposing viewpoints).