REAL-MM-RAG
A Real-World Multi-Modal Retrieval Benchmark
1IBM Research Israel 2Weizmann Institute of Science
Proposed Real-MM-RAG Benchmark

Abstract
Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries, and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on the REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.
Benchmark Construction Pipeline

Manually creating high-quality benchmarks is time-consuming and error-prone. To address this, we introduce an automated pipeline for generating and verifying queries tailored for Retrieval-Augmented Generation (RAG) evaluation. Our benchmark focuses on long documents with multiple pages from the same sub-domain, ensuring realistic retrieval challenges. The process begins with query generation using a Vision-Language Model (VLM) to create natural user-like questions. These queries are then verified by a Large Language Model (LLM) to filter out those that are too broad, overly specific, or reference exact page locations. To enhance robustness, queries undergo multi-level rephrasing to reduce reliance on exact wording while preserving meaning. Finally, a false negative verification step ensures each query is correctly linked to its relevant pages, preventing errors and filtering out ambiguous cases. The result is a high-quality dataset of document pages, queries, and corresponding answers, designed for robust RAG evaluation.
Document Retrieval Benchmarks Comparison

Existing benchmarks lack essential properties for effective multi-modal retrieval evaluation. Most rely on short documents with limited sub-domain coverage, making retrieval too easy. Our benchmark addresses this by using long documents with extensive sub-domain coverage, ensuring the presence of many similar pages within the benchmark. We further enhance difficulty with multi-level query rephrasing, preventing keyword-based retrieval and promoting true semantic understanding. Unlike most existing benchmarks that use QA-derived queries, which are not suited for RAG, we ensure queries resemble real user inquiries through a dedicated generation and filtering pipeline. Finally, to combat the high false-negative rates—shown to be significant in prior benchmarks—we employ a VLM-based verification process for accurate labeling. The table highlights these improvements, making our benchmark the most reliable for real-world RAG evaluation.
Performance of Different Models on Our Benchmark

We evaluate various models, including text-based and vision-based approaches, across our four benchmarks. Results, measured using NDCG@5, are reported on our final benchmark with queries rephrased at the highest level (Level 3). We also present results for our fine-tuned models trained on our proposed datasets: Rob – trained on a rephrased dataset, Tab – trained on a table-heavy dataset, and RobTab – incorporating both.
Table-Focused Training Improves Financial Benchmarks

Fine-tuning with our proposed table-heavy training set, combined with the ColPali training set (both in their original and rephrased versions) significantly enhances performance on financial benchmarks (results shown for rephrasing level 3).
Fine-Tuning on Rephrased Training Set

We compare the NDCG@5 scores across rephrasing levels for baseline models (ColPali and ColQwen) against our fine-tuned models (RobCol). The results demonstrate that fine-tuning with our rephrased training data significantly enhances rephrasing robustness for both ColPali and ColQwen.
BibTeX
@misc{wasserman2025realmmragrealworldmultimodalretrieval, title={REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark}, author={Navve Wasserman and Roi Pony and Oshri Naparstek and Adi Raz Goldfarb and Eli Schwartz and Udi Barzelay and Leonid Karlinsky}, year={2025}, eprint={2502.12342}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2502.12342}, }