Arabic RAG Leaderboard: A Comprehensive Framework for Evaluating Arabic Language Retrieval Systems

Community Article Published February 9, 2025

Introduction

In the Arabic-speaking world, where information authenticity is paramount, the need for reliable information retrieval systems is critical. The field of Retrieval-Augmented Generation (RAG) is transforming how we interact with large language models, with dynamic leaderboards like MTEB and Open LLM emerging as essential benchmarking tools. However, Arabic models remain underrepresented in these evaluations, creating a critical gap in our ability to assess Arabic-specific RAG systems.

Our leaderboard project addresses this gap by evaluating both retrieval and re-ranking components, with plans to expand evaluations to additional components soon, aiming to become the ultimate hub for all retrieval needs. To ensure fairness and prevent overfitting, datasets remain private during the evaluation cycles.

Embedding Models is All You Need

Embedding models are the backbone of modern retrieval systems, enabling various applications beyond traditional search. These models transform text into dense vector representations, making it easier to find relevant information efficiently.

Versatile Applications of Embeddings

  1. Semantic Search: Embeddings allow search engines to retrieve relevant documents based on meaning rather than exact keyword matches.
  2. Recommendation Systems: Many platforms use embeddings to recommend content by analyzing user preferences and behavior.
  3. Clustering & Classification: Businesses leverage embeddings to categorize large-scale textual data for sentiment analysis, customer support automation, and fraud detection.
  4. Cross-Language Retrieval: Multilingual embeddings bridge the gap between different languages, enabling seamless cross-language search and translation.
  5. Knowledge Graph Augmentation: Embeddings enhance knowledge graphs by linking related concepts, improving contextual understanding in AI-driven applications.

The Future of Embedding Models

With continuous improvements in transformer-based architectures, embeddings are becoming even more efficient and adaptable. The Arabic RAG Leaderboard aims to highlight the best embedding models tailored for Arabic language retrieval, showcasing their impact across diverse real-world scenarios.

The Overall Leaderboard Framework

Purpose and Scope

In today's rapidly evolving landscape of Arabic NLP, building robust RAG pipelines hinges on the careful evaluation of both retrieval components and re-ranker models. Our leaderboard framework addresses this dual challenge by implementing a two-stream evaluation strategy. On one hand, it assesses retrieval performance across diverse datasets and task-specific domains; on the other, it evaluates the fine-grained ranking capabilities of re-rankers using established metrics (see the Metrics section).

The beauty of our approach lies in its unification: we aggregate results derived from different evaluation methodologies into one transparent, cohesive view. This holistic framework not only ensures that each component is rigorously assessed but also provides an end-to-end performance indicator for RAG systems—guiding practitioners toward optimal model selection for real-world applications.

image/png

      Figure 1: Mind Map Overview of the Arabic RAG Leaderboard Framework

Key Contributions

Addressing a Critical Need

  • Our leaderboard fills a longstanding gap in Arabic NLP by offering a comprehensive benchmark for both retrieval and re-ranker components.
  • It provides transparent, multi-metric evaluations that empower model developers and end users to make informed decisions for building production-ready RAG pipelines.

Dual Evaluation Streams

  1. Retrieval Evaluation

    • Focuses on dataset diversity and task-specific capabilities by evaluating performance on a wide range of datasets—from general web search queries to domain-specific retrieval tasks.
  2. Re-Ranker Evaluation

    • Emphasizes fine-grained ranking quality using richly annotated datasets with graded relevance labels (for NDCG) and binary labels (for MRR and MAP).

Framework Methodology

Privacy for Fairness

  • To prevent overfitting and ensure unbiased evaluations, our datasets remain private during testing cycles.

Extendable to Reach Every Domain

  • Our framework aims to be adaptable across various domains where RAGs are applied, so everyone can find the best model for his needs.

Dual Philosophies in RAG Assessment

The evaluation methodology for RAG systems presents two distinct philosophical approaches, each offering unique perspectives on assessment criteria:

  1. Metric-Centered Evaluation: This approach emphasizes numerical scores derived from evaluation metrics such as accuracy, recall, and precision. It is particularly useful for those who want a high-level overview of their model's performance, allowing for easy comparison across different implementations.

  2. Dataset-Centered Evaluation: This approach focuses on the specific datasets used in evaluation and the domains they represent. It is ideal for practitioners who build RAG systems tailored for particular use cases and want to ensure that their models perform well within those domains.

Given the complementary strengths of both approaches, our framework incorporates a balanced integration of these methodologies to provide comprehensive evaluation coverage.

Retrieval Evaluation

In a Retrieval-Augmented Generation (RAG) pipeline, the retrieval component is responsible for scanning vast corpora to extract candidate documents or contexts that may contain the information needed to answer a query. This first stage is crucial, as it determines the pool of candidates that the re-ranker will later refine if the RAG is two stages designed.

Dataset Design for Retrieval Evaluation

The retrieval evaluation uses the Dataset-Centered Evaluation philosophy. Each dataset in the leaderboard will reflect a specific domain or task, ensuring that the evaluation is tailored to the needs of the community.

For now, we have one dataset called the Web Search Dataset that simulates a general web search scenario. This dataset is designed to test the retrieval component's ability to extract correct contexts from the web, covering a wide range of topics and query types. The dataset is generated from the ground up not based on any prior work, and it is designed to be privacy-preserving to prevent overfitting.

Metrics for Retrieval Evaluation

  • MRR: Measures the average reciprocal rank of the first relevant document for each query.
  • nDCG(k=None): Evaluates ranking quality based on the graded relevance of the retrieved documents without a fixed cutoff.
  • Recall (k=5): Assesses the proportion of relevant documents retrieved within the top 5 results.

Re-Ranking Evaluation

The re-ranker evaluation focuses on the second stage of the RAG pipeline, where a fine-grained ranking mechanism is applied to the candidate contexts retrieved by the initial retrieval component. This stage is crucial for ensuring that the most relevant information is presented to the generation model, enhancing the overall quality of the final output.

Dataset Design and Metrics for Reranker Evaluation

Our re-ranker evaluation dataset is built as a hybrid resource that leverages both real user queries and synthetically generated contexts. This design enables a targeted evaluation of ranking quality while preserving the natural realism that only genuine human queries can provide. Specifically, real queries are sourced from high-quality, human-annotated datasets such as:

  • TyDi QA: "TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages" (Clark et al., 2020)
  • MKQA: "MKQA: A Multilingual Knowledge Questions Answering Dataset" (Lewis et al., 2020)

In parallel, synthetic contexts are generated using multiple large language models acting as agent writers. This controlled, reproducible approach allows us to precisely evaluate the re-ranker's ability to order candidate contexts, ensuring that while the authenticity of real human queries is maintained, we can also tailor and stress-test the ranking capabilities using synthetic data.

image/png

       Figure 2: Pie chart illustrating the reranker dataset design

Query Split

  • 20% Queries with short contexts:

    • These are queries with concise contexts—situations where the relevant information is brief. This split allows us to test the re-ranker's performance in scenarios where answers must be identified from limited context.
  • 80% Queries with long contexts:

    • These consist of detailed, context-rich interactions where candidate documents are more verbose. This majority split ensures that we evaluate the re-ranker's ability to sift through extensive information and correctly prioritize relevant content.

Metric-Specific Labeling

  • NDCG@10: Uses graded relevance labels where 3 indicates highly relevant, 2 moderately relevant, 1 marginally relevant, and 0 irrelevant.
  • MRR@10: Utilizes binary labels—only one candidate per query is marked "1."
  • MAP: Uses binary labels but allows multiple correct contexts among the top 10.

Evaluation Metrics Explained

Mean Reciprocal Rank (MRR)

  • Definition: The average of the reciprocal ranks for the first relevant candidate across queries.
  • Formula: MRR = (1 / N) * Σ (1 / rank_i) where N is the number of queries, and rank_i is the rank position of the first correct candidate for query i.
  • Calculation Example: For a query where the correct candidate appears at rank 4, the reciprocal rank is 1/4 = 0.25.

Normalized Discounted Cumulative Gain (nDCG)

  • Definition: A metric that measures ranking quality by comparing the Discounted Cumulative Gain (DCG) of the re-ranked list to that of an ideal ranking using graded relevance scores.
  • Formula: DCG = Σ_i=1^k ((2^rel_i -- 1) / log_2(i + 1)) NDCG = DCG / IDCG where k is the number of top results (here, 10), rel_i is the relevance score at position i, and IDCG is the maximum possible DCG for that query.
  • Calculation Example: For a given query with graded relevance scores yielding a DCG that is 68% of the ideal DCG, NDCG@10 = 0.68.

Mean Average Precision (MAP)

  • Definition: The mean of the average precision values computed over the top 10 candidates for each query, considering that there may be multiple correct responses.
  • Formula: For each query, AP = (Σ_i=1^k (Precision@i × rel_i)) / (Number of relevant documents) MAP = (1 / N) * Σ_i=1^N AP_i where k is 10, N is the number of queries, and rel_i is 1 if the document at rank i is relevant, 0 otherwise.
  • Calculation Example: For a query with relevant documents at positions 2, 5, and 7, AP is computed as (Precision@2 + Precision@5 + Precision@7) / 3.

Recall@k

  • Definition: The proportion of relevant documents retrieved within the top k results.
  • Formula: Recall@k = (Number of relevant documents in top k) / (Total number of relevant documents).
  • Calculation Example: If a query has 5 relevant documents, and the top 10 results contain 3 of them, Recall@10 = 3 / 5 = 0.6.

Conclusion and Future Directions

Summary of Contributions

Our work provides several key contributions to the field of Arabic language RAG systems:

  1. Comprehensive Evaluation Framework

    • The first dedicated benchmark for Arabic RAG systems
    • Dual-stream evaluation approach covering both retrieval and re-ranking
    • Privacy-preserving dataset architecture preventing overfitting
    • Domain-specific categorization enabling targeted model selection
  2. Innovative Dataset Design

    • Hybrid architecture combining authentic user queries with synthetic contexts
    • Carefully curated query splits (20% short, 80% long) reflecting real-world scenarios
    • Integration of high-quality sources like TyDi QA and MKQA
    • Multi-layered labeling system supporting diverse evaluation metrics
  3. Robust Metric Implementation

    • Multi-dimensional scoring system incorporating NDCG@10, MRR@10, and MAP
    • Granular relevance assessments using both graded and binary labels
    • Transparent evaluation methodology enabling reproducible results
    • Adaptable framework supporting future metric additions

Future Work and Call to Action

Planned Developments

  • Integrate additional datasets to the retrieval category of the leaderboard
  • Extend the leaderboard with more RAG components
  • Automate evaluation cycles for faster feedback loops

Call to Action

We invite the community to:

  • Submit models for evaluation
  • Share any feedback on potential improvements or bugs
  • Collaborate on expanding the leaderboard to cover more RAG components

Community

Sign up or log in to comment