vidore-benchmark

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.

https://github.com/illuin-tech/vidore-benchmark

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    1 of 8 committers (12.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.5%) to scientific vocabulary

Keywords

colpali rag retrieval search vision-language-model

Keywords from Contributors

cryptocurrency cryptography jax transformer
Last synced: 6 months ago · JSON representation ·

Repository

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.

Basic Info
Statistics
  • Stars: 223
  • Watchers: 5
  • Forks: 29
  • Open Issues: 4
  • Releases: 17
Topics
colpali rag retrieval search vision-language-model
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation

README.md

Vision Document Retrieval (ViDoRe): Benchmarks 👀

arXiv GitHub Hugging Face

Test Version Downloads


[Model card] [ViDoRe Leaderboard] [Demo] [Blog Post]

Approach

The Visual Document Retrieval Benchmarks (ViDoRe v1 and v2), is introduced to evaluate the performance of document retrieval systems on visually rich documents across various tasks, domains, languages, and settings. It was used to evaluate the ColPali model, a VLM-powered retriever that efficiently retrieves documents based on their visual content and textual queries using a late-interaction mechanism.

ViDoRe Examples

⚠️ Deprecation Warning: Moving from vidore-benchmark to mteb

Since mteb now supports image-text retrieval, we recommend using mteb to evaluate your retriever on the ViDoRe benchmark. We are deprecating vidore-benchmark to facilitate maintenance and have a single source of truth for the ViDoRe benchmark.

If you want your results to appear on the ViDoRe Leaderboard, you should add them to the results Github Project. Check the Submit your model section of the ViDoRe Leaderboard for more information.

New Evaluation Process

Follow the instructions to setup mteb here. Then you have 2 options.

Option 1: CLI

bash mteb run -b "ViDoRe(v1)" -m "vidore/colqwen2.5-v0.2" mteb run -b "ViDoRe(v2)" -m "vidore/colqwen2.5-v0.2"

Option 2: Python Script

```python import mteb from mteb.modelmeta import ModelMeta from mteb.models.colqwenmodels import ColQwen2_5Wrapper

=== Configuration ===

MODEL_NAME = "johndoe/mycolqwen2.5" BENCHMARKS = ["ViDoRe(v1)", "ViDoRe(v2)"]

=== Model Metadata ===

custommodelmeta = ModelMeta( loader=ColQwen25Wrapper, name=MODELNAME, modalities=["image", "text"], framework="Colpali", similarityfnname="max_sim", # Optional metadata (fill in if available else None) ... )

=== Load Model ===

custommodel = custommodelmeta.loadmodel(MODEL_NAME)

=== Load Tasks ===

tasks = mteb.get_benchmarks(names=BENCHMARKS) evaluator = mteb.MTEB(tasks=tasks)

=== Run Evaluation ===

results = evaluator.run(custom_model) ```

For custom models, you should implement your own wrapper. Check the ColPaliEngineWrapper for an example.

[Deprecated] Usage

This packages comes with a Python API and a CLI to evaluate your own retriever on the ViDoRe benchmark. Both are compatible with Python>=3.9.

CLI mode

bash pip install vidore-benchmark

To keep this package lightweight, only the essential packages were installed. Thus, you must specify the dependency groups for models you want to evaluate with CLI (see the list in pyproject.toml). For instance, if you are going to evaluate the ColVision models (e.g. ColPali, ColQwen2, ColSmol, ...), you should run:

bash pip install "vidore-benchmark[colpali-engine]"

[!WARNING] If possible, do not pip install colpali-engine directly in the env dedicated for the CLI.

In particular, make sure not to install both vidore-benchmark[colpali-engine] and colpali-engine[train] simultaneously, as it will lead to a circular depencency conflict.

If you want to install all the dependencies for all the models, you can run:

bash pip install "vidore-benchmark[all-retrievers]"

Note that in order to use BM25Retriever, you will need to download the nltk resources too:

bash pip install "vidore-benchmark[bm25]" python -m nltk.downloader punkt punkt_tab stopwords

Library mode

Install the base package using pip:

bash pip install vidore-benchmark

Command-line usage

Evaluate a retriever on ViDoRE

You can evaluate any off-the-shelf retriever on the ViDoRe benchmark v1. For instance, you can evaluate the ColPali model on the ViDoRe benchmark 1 to reproduce the results from our paper.

bash vidore-benchmark evaluate-retriever \ --model-class colpali \ --model-name vidore/colpali-v1.3 \ --collection-name vidore/vidore-benchmark-667173f98e70a1c0fa4db00d \ --dataset-format qa \ --split test

If you want to evaluate your models on on new collection ViDoRe benchmark 2, a harder version of the previous benchmark you can execute the following command:

bash vidore-benchmark evaluate-retriever \ --model-class colpali \ --model-name vidore/colpali-v1.3 \ --collection-name vidore/vidore-benchmark-v2-67ae03e3924e85b36e7f53b0 \ --dataset-format beir \ --split test

Alternatively, you can evaluate your model on a single dataset. If your retriver uses visual embeddings, you can use any dataset path from the ViDoRe Benchmark v1 collection or the ViDoRe Benchmark v2 (beir format instead of qa), e.g.:

bash vidore-benchmark evaluate-retriever \ --model-class colpali \ --model-name vidore/colpali-v1.3 \ --dataset-name vidore/docvqa_test_subsampled \ --dataset-format qa \ --split test

If you want to evaluate a retriever that relies on pure-text retrieval (no visual embeddings), you should use the datasets from the ViDoRe Chunk OCR (baseline) instead:

bash vidore-benchmark evaluate-retriever \ --model-class bge-m3 \ --model-name BAAI/bge-m3 \ --dataset-name vidore/docvqa_test_subsampled_tesseract \ --dataset-format qa \ --split test

All the above scripts will generate a JSON file in outputs/{model_id}_metrics.json. Follow the instructions on the ViDoRe Leaderboard to learn how to publish your results on the leaderboard too!

[!NOTE] The vidore-benchmark package supports two formats of datasets:

  • QA: The dataset is formatted as a question-answering task, where the queries are questions and the passages are the image pages that provide the answers.
  • BEIR: Following the BEIR paper, the dataset is formatted in 3 sub-datasets: corpus, queries, and qrels. The corpus contains the documents, the queries contains the queries, and the qrels contains the relevance scores between the queries and the documents.

In the first iteration of the ViDoRe benchmark, we arbitrarily choose to deduplicate the queries for the QA datasets. While this made sense given our data generation process, it wasn't suited for our ViDoRe benchmark v2 which aims at being broader and multilingual. We will release the ViDoRe benchmark v2 soon.

| Dataset | Dataset format | Deduplicate queries | |------------------------------------------------------------------------------------------------------------|----------------|---------------------| | ViDoRe benchmark v1 | QA | ✅ | | ViDoRe benchmark v2 (harder/multilingual) | BEIR | ❌ |

Documentation

To have more control over the evaluation process (e.g. the batch size used at inference), read the CLI documentation using:

bash vidore-benchmark evaluate-retriever --help

In particular, feel free to play with the --batch-query, --batch-passage, --batch-score, and --num-workers inputs to speed up the evaluation process.

Python usage

Quickstart example

While the CLI can be used to evaluate a fixed list of models, you can also use the Python API to evaluate your own retriever. Here is an example of how to evaluate the ColPali model on the ViDoRe benchmark. Note that your processor must implement a process_images and a process_queries methods, similarly to the ColVision processors.

```python import torch from colpaliengine.models import ColIdefics3, ColIdefics3Processor from datasets import loaddataset from tqdm import tqdm

from vidorebenchmark.evaluation.vidoreevaluators import ViDoReEvaluatorQA, ViDoReEvaluatorBEIR from vidorebenchmark.retrievers import VisionRetriever from vidorebenchmark.utils.datautils import getdatasetsfromcollection

modelname = "vidore/colSmol-256M" processor = ColIdefics3Processor.frompretrained(modelname) model = ColIdefics3.frompretrained( modelname, torchdtype=torch.bfloat16, device_map="cuda", ).eval()

Get retriever instance

vision_retriever = VisionRetriever(model=model, processor=processor)

Evaluate on a single BEIR format dataset (e.g one of the ViDoRe benchmark 2 dataset)

vidoreevaluatorbeir = ViDoReEvaluatorBEIR(visionretriever) ds = { "corpus" : loaddataset("vidore/syntheticrserestaurantfilteredv1.0", name="corpus", split="test"), "queries" : loaddataset("vidore/syntheticrserestaurantfilteredv1.0", name="queries", split="test") "qrels" : loaddataset("vidore/syntheticrserestaurantfilteredv1.0", name="qrels", split="test") } metricsdatasetbeir = vidoreevaluatorbeir.evaluatedataset( ds=ds, batchquery=4, batchpassage=4, ) print(metricsdataset_beir)

Evaluate on a single QA format dataset

vidoreevaluatorqa = ViDoReEvaluatorQA(visionretriever) ds = loaddataset("vidore/tabfquadtestsubsampled", split="test") metricsdatasetqa = vidoreevaluatorqa.evaluatedataset( ds=ds, batchquery=4, batchpassage=4, ) print(metricsdataset_qa)

Evaluate on a local directory or a HuggingFace collection

datasetnames = getdatasetsfromcollection("vidore/vidore-benchmark-667173f98e70a1c0fa4db00d") metricscollection = {} for datasetname in tqdm(datasetnames, desc="Evaluating dataset(s)"): metricscollection[datasetname] = vidoreevaluator.evaluatedataset( ds=loaddataset(datasetname, split="test"), batchquery=4, batchpassage=4, ) print(metricscollection) ```

Implement your own retriever

If you want to evaluate your own retriever to use it with the CLI, you should clone the repository and add your own class that inherits from BaseVisionRetriever. You can find the detailed instructions here.

Compare retrievers using the EvalManager

To easily process, visualize and compare the evaluation metrics of multiple retrievers, you can use the EvalManager class. Assume you have a list of previously generated JSON metric files, e.g.:

bash data/metrics/ ├── bisiglip.json └── colpali.json

The data is stored in eval_manager.data as a multi-column DataFrame with the following columns. Use the get_df_for_metric, get_df_for_dataset, and get_df_for_model methods to get the subset of the data you are interested in. For instance:

```python from vidore_benchmark.evaluation import EvalManager

evalmanager = EvalManager.fromdir("data/metrics/") df = evalmanager.getdfformetric("ndcgat5") ```

Citation

ColPali: Efficient Document Retrieval with Vision Language Models

Authors: Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution)

```latex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, }

@misc{macé2025vidorebenchmarkv2raising, title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval}, author={Quentin Macé and António Loison and Manuel Faysse}, year={2025}, eprint={2505.17166}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2505.17166}, } ```

If you want to reproduce the results from the ColPali paper, please read the REPRODUCIBILITY.md file for more information.

Owner

  • Name: ILLUIN Technology
  • Login: illuin-tech
  • Kind: organization
  • Email: contact@illuin.tech
  • Location: Paris, France

Illuin Technology est une équipe de makers motivé·e·s par les challenges de l'IA et les nouveaux modes utilisateurs que cette intelligence permet.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Faysse"
  given-names: "Manuel"
  email: "manuel.faysse@illuin.tech"
- family-names: "Sibille"
  given-names: "Hugues"
  email: "hugues.sibille@illuin.tech"
- family-names: "Wu"
  given-names: "Tony"
  email: "tony.wu@illuin.tech"
title: "Vision Document Retrieval (ViDoRe): Benchmark"
date-released: 2024-06-26
url: "https://github.com/illuin-tech/vidore-benchmark"
preferred-citation:
  type: article
  authors:
  - family-names: "Faysse"
    given-names: "Manuel"
  - family-names: "Sibille"
    given-names: "Hugues"
  - family-names: "Wu"
    given-names: "Tony"
  - family-names: "Omrani"
    given-names: "Bilel"
  - family-names: "Viaud"
    given-names: "Gautier"
  - family-names: "Hudelot"
    given-names: "Céline"
  - family-names: "Colombo"
    given-names: "Pierre"
  doi: "arXiv.2407.01449"
  month: 6
  title: "ColPali: Efficient Document Retrieval with Vision Language Models"
  year: 2024
  url: "https://arxiv.org/abs/2407.01449"

GitHub Events

Total
  • Create event: 50
  • Issues event: 26
  • Release event: 2
  • Watch event: 96
  • Delete event: 37
  • Member event: 2
  • Issue comment event: 33
  • Push event: 216
  • Pull request review comment event: 31
  • Pull request review event: 43
  • Pull request event: 97
  • Fork event: 19
Last Year
  • Create event: 50
  • Issues event: 26
  • Release event: 2
  • Watch event: 96
  • Delete event: 37
  • Member event: 2
  • Issue comment event: 33
  • Push event: 216
  • Pull request review comment event: 31
  • Pull request review event: 43
  • Pull request event: 97
  • Fork event: 19

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 266
  • Total Committers: 8
  • Avg Commits per committer: 33.25
  • Development Distribution Score (DDS): 0.15
Past Year
  • Commits: 266
  • Committers: 8
  • Avg Commits per committer: 33.25
  • Development Distribution Score (DDS): 0.15
Top Committers
Name Email Commits
Tony Wu 2****1 226
Hugues h****e@s****h 24
Manuel Faysse 4****y 7
QuentinJGMace 9****e 3
Hugues Sibille h****e@i****h 3
antonioloison 4****n 1
Daniel v****a 1
ByeongkiJeong j****8@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 19
  • Total pull requests: 116
  • Average time to close issues: 20 days
  • Average time to close pull requests: 6 days
  • Total issue authors: 16
  • Total pull request authors: 10
  • Average comments per issue: 1.37
  • Average comments per pull request: 0.16
  • Merged pull requests: 99
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 19
  • Pull requests: 95
  • Average time to close issues: 20 days
  • Average time to close pull requests: 7 days
  • Issue authors: 16
  • Pull request authors: 10
  • Average comments per issue: 1.37
  • Average comments per pull request: 0.19
  • Merged pull requests: 79
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • roipony (2)
  • yydxlv (2)
  • ManuelFay (2)
  • StupidBuluchacha (1)
  • XuHuang441 (1)
  • puar-playground (1)
  • tattrongvu (1)
  • jbellis (1)
  • bumblyowl (1)
  • canqin001 (1)
  • 921574602 (1)
  • SeanLee97 (1)
  • tonywu71 (1)
  • ashokrajab (1)
  • SpaceLearner (1)
Pull Request Authors
  • tonywu71 (120)
  • ManuelFay (16)
  • QuentinJGMace (8)
  • ByeongkiJeong (4)
  • github-bowen (2)
  • tattrongvu (2)
  • velaia (2)
  • antonioloison (2)
  • paultltc (2)
  • guenthermi (1)
Top Labels
Issue Labels
bug (2) enhancement (1)
Pull Request Labels
enhancement (29) bug (13) documentation (10)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,557 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 17
  • Total maintainers: 1
pypi.org: vidore-benchmark

Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.

  • Versions: 17
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 1,557 Last month
Rankings
Dependent packages count: 10.7%
Average: 35.5%
Dependent repos count: 60.4%
Maintainers (1)
Last synced: about 1 year ago

Dependencies

.github/workflows/publish.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
pyproject.toml pypi
  • GPUtil >=1.4.0,<2.0.0
  • datasets >=2.15.0,<3.0.0
  • einops >=0.8.0,<1.0.0
  • mteb >=1.12.47,<2.0.0
  • numpy >=1.21.2,<2.0.0
  • pdf2image >=1.17.0,<2.0.0
  • peft >=0.11.1,<1.0.0
  • pillow >=9.2.0,<11.0.0
  • python-dotenv >=1.0.1,<2.0.0
  • sentencepiece >=0.2.0,<1.0.0
  • torch >=2.0.0,<3.0.0
  • transformers >=4.41.1,<5.0.0
  • typer >=0.12.3,<1.0.0
requirements.txt pypi
  • FlagEmbedding ==1.2.10
  • Jinja2 ==3.1.4
  • MarkupSafe ==2.1.5
  • PyYAML ==6.0.1
  • Pygments ==2.18.0
  • accelerate ==0.30.1
  • aiofiles ==23.2.1
  • aiohttp ==3.9.5
  • aiosignal ==1.3.1
  • altair ==5.3.0
  • annotated-types ==0.7.0
  • anyio ==4.4.0
  • asttokens ==2.4.1
  • attrs ==23.2.0
  • black ==24.4.2
  • certifi ==2024.6.2
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • comm ==0.2.2
  • configue ==5.0.0
  • contourpy ==1.2.1
  • cycler ==0.12.1
  • datasets ==2.19.1
  • debugpy ==1.8.1
  • decorator ==5.1.1
  • dill ==0.3.8
  • diskcache ==5.6.3
  • dnspython ==2.6.1
  • einops ==0.8.0
  • email_validator ==2.2.0
  • eval_type_backport ==0.2.0
  • executing ==2.0.1
  • fastapi ==0.111.0
  • fastapi-cli ==0.0.4
  • ffmpy ==0.3.2
  • filelock ==3.15.3
  • fonttools ==4.53.0
  • frozenlist ==1.4.1
  • fsspec ==2024.3.1
  • gradio ==4.36.1
  • gradio_client ==1.0.1
  • h11 ==0.14.0
  • httpcore ==1.0.5
  • httptools ==0.6.1
  • httpx ==0.27.0
  • huggingface-hub ==0.23.4
  • idna ==3.7
  • importlib_resources ==6.4.0
  • iniconfig ==2.0.0
  • ipykernel ==6.29.4
  • ipython ==8.25.0
  • jedi ==0.19.1
  • joblib ==1.4.2
  • jsonlines ==4.0.0
  • jsonschema ==4.22.0
  • jsonschema-specifications ==2023.12.1
  • jupyter_client ==8.6.2
  • jupyter_core ==5.7.2
  • kiwisolver ==1.4.5
  • markdown-it-py ==3.0.0
  • matplotlib ==3.9.0
  • matplotlib-inline ==0.1.7
  • mdurl ==0.1.2
  • mpmath ==1.3.0
  • mteb ==1.12.47
  • multidict ==6.0.5
  • multiprocess ==0.70.16
  • mypy-extensions ==1.0.0
  • nest-asyncio ==1.6.0
  • networkx ==3.3
  • nltk ==3.8.1
  • numpy ==1.26.4
  • nvidia-cublas-cu12 ==12.1.3.1
  • nvidia-cuda-cupti-cu12 ==12.1.105
  • nvidia-cuda-nvrtc-cu12 ==12.1.105
  • nvidia-cuda-runtime-cu12 ==12.1.105
  • nvidia-cudnn-cu12 ==8.9.2.26
  • nvidia-cufft-cu12 ==11.0.2.54
  • nvidia-curand-cu12 ==10.3.2.106
  • nvidia-cusolver-cu12 ==11.4.5.107
  • nvidia-cusparse-cu12 ==12.1.0.106
  • nvidia-nccl-cu12 ==2.20.5
  • nvidia-nvjitlink-cu12 ==12.5.40
  • nvidia-nvtx-cu12 ==12.1.105
  • orjson ==3.10.5
  • packaging ==24.1
  • pandas ==2.2.2
  • parso ==0.8.4
  • pathspec ==0.12.1
  • pdf2image ==1.17.0
  • peft ==0.11.1
  • pexpect ==4.9.0
  • pillow ==10.3.0
  • platformdirs ==4.2.2
  • pluggy ==1.5.0
  • polars ==0.20.31
  • prompt_toolkit ==3.0.47
  • protobuf ==5.27.1
  • psutil ==6.0.0
  • ptyprocess ==0.7.0
  • pure-eval ==0.2.2
  • pyarrow ==16.1.0
  • pyarrow-hotfix ==0.6
  • pydantic ==2.7.4
  • pydantic_core ==2.18.4
  • pydub ==0.25.1
  • pyparsing ==3.1.2
  • pytesseract ==0.3.10
  • pytest ==8.2.2
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.0.1
  • python-multipart ==0.0.9
  • pytrec-eval-terrier ==0.5.6
  • pytz ==2024.1
  • pyzmq ==26.0.3
  • rank-bm25 ==0.2.2
  • referencing ==0.35.1
  • regex ==2024.5.15
  • requests ==2.32.3
  • rich ==13.7.1
  • rpds-py ==0.18.1
  • ruff ==0.4.10
  • safetensors ==0.4.3
  • scikit-learn ==1.5.0
  • scipy ==1.13.1
  • seaborn ==0.13.2
  • semantic-version ==2.10.0
  • sentence-transformers ==3.0.1
  • sentencepiece ==0.2.0
  • shellingham ==1.5.4
  • six ==1.16.0
  • sniffio ==1.3.1
  • stack-data ==0.6.3
  • starlette ==0.37.2
  • sympy ==1.12.1
  • threadpoolctl ==3.5.0
  • timm ==1.0.7
  • tokenizers ==0.19.1
  • tomlkit ==0.12.0
  • toolz ==0.12.1
  • torch ==2.3.1
  • torchvision ==0.18.1
  • tornado ==6.4.1
  • tqdm ==4.66.4
  • traitlets ==5.14.3
  • transformers ==4.41.2
  • triton ==2.3.1
  • typer ==0.12.3
  • typing_extensions ==4.12.2
  • tzdata ==2024.1
  • ujson ==5.10.0
  • urllib3 ==2.2.2
  • uvicorn ==0.30.1
  • uvloop ==0.19.0
  • vidore-benchmark ==1.0.0
  • watchfiles ==0.22.0
  • wcwidth ==0.2.13
  • websockets ==11.0.3
  • xxhash ==3.4.1
  • yarl ==1.9.4