https://github.com/datascienceuibk/llm-reranking-generalization-study

How Good are LLM-based Rerankers? Accepted at EMNLP Findings 2025

https://github.com/datascienceuibk/llm-reranking-generalization-study

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

dataset leaderboard llm reranker reranking retrieval
Last synced: 5 months ago · JSON representation

Repository

How Good are LLM-based Rerankers? Accepted at EMNLP Findings 2025

Basic Info
  • Host: GitHub
  • Owner: DataScienceUIBK
  • License: apache-2.0
  • Default Branch: main
  • Homepage:
  • Size: 706 KB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
dataset leaderboard llm reranker reranking retrieval
Created 6 months ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models 🔍

🎉 News📖 Introduction📄 FutureQueryEval Dataset

🚀 Quick Start📊 Results🎈 Citation

🎉 News

  • [2025-08-22] 🎯 FutureQueryEval Dataset Released! - The first temporal IR benchmark with queries from April 2025+
  • [2025-08-22] 🔧 Comprehensive evaluation framework released - 22 reranking methods, 40 variants tested
  • [2025-08-22] 📊 Integrated with RankArena leaderboard. You can view and interact with RankArena through this link
  • [2025-08-20] 📝 Paper accepted at EMNLP Findings 2025

📖 Introduction

We present the most comprehensive empirical study of reranking methods to date, systematically evaluating 22 state-of-the-art approaches across 40 variants. Our key contribution is FutureQueryEval - the first temporal benchmark designed to test reranker generalization on truly novel queries unseen during LLM pretraining.

Performance Overview

Performance comparison across pointwise, pairwise, and listwise reranking paradigms

Key Findings 🔍

  • Temporal Performance Gap: 5-15% performance drop on novel queries compared to standard benchmarks
  • Listwise Superiority: Best generalization to unseen content (8% avg. degradation vs 12-15% for others)
  • Efficiency Trade-offs: Comprehensive runtime analysis reveals optimal speed-accuracy combinations
  • Domain Vulnerabilities: All methods struggle with argumentative and informal content

📄 FutureQueryEval Dataset

Overview

FutureQueryEval is a novel IR benchmark comprising 148 queries with 2,938 query-document pairs across 7 topical categories, designed to evaluate reranker performance on temporal novelty.

🎯 Why FutureQueryEval?

  • Zero Contamination: All queries refer to events after April 2025
  • Human Annotated: 4 expert annotators with quality control
  • Diverse Domains: Technology, Sports, Politics, Science, Health, Business, Entertainment
  • Real Events: Based on actual news and developments, not synthetic data

📊 Dataset Statistics

| Metric | Value | |--------|-------| | Total Queries | 148 | | Total Documents | 2,787 | | Query-Document Pairs | 2,938 | | Avg. Relevant Docs per Query | 6.54 | | Languages | English | | License | Apache-2.0 license |

🌍 Category Distribution

  • Technology: 25.0% (37 queries)
  • Sports: 20.9% (31 queries)
  • Science & Environment: 13.5% (20 queries)
  • Business & Finance: 12.8% (19 queries)
  • Health & Medicine: 10.8% (16 queries)
  • World News & Politics: 9.5% (14 queries)
  • Entertainment & Culture: 7.4% (11 queries)

📝 Example Queries

``` 🌍 World News & Politics: "What specific actions has Egypt taken to support injured Palestinians from Gaza, as highlighted during the visit of Presidents El-Sisi and Macron to Al-Arish General Hospital?"

⚽ Sports: "Which teams qualified for the 2025 UEFA European Championship playoffs in June 2025?"

💻 Technology: "What are the key features of Apple's new Vision Pro 2 announced at WWDC 2025?" ```

Data Collection Methodology

  1. Source Selection: Major news outlets, official sites, sports organizations
  2. Temporal Filtering: Events after April 2025 only
  3. Query Creation: Manual generation by domain experts
  4. Novelty Validation: Tested against GPT-4 knowledge cutoff
  5. Quality Control: Multi-annotator review with senior oversight

🚀 Quick Start

The code and Dataset will be available soon...

📊 Evaluation Results

Top Performers on FutureQueryEval

| Method Category | Best Model | NDCG@10 | Runtime (s) | |----------------|------------|---------|-------------| | Listwise | Zephyr-7B | 62.65 | 1,240 | | Pointwise | MonoT5-3B | 60.75 | 486 | | Setwise | Flan-T5-XL | 56.57 | 892 | | Pairwise | EchoRank-XL | 54.97 | 2,158 | | Tournament | TourRank-GPT4o | 62.02 | 3,420 |

Performance Insights

  • 🏆 Best Overall: Zephyr-7B (62.65 NDCG@10)
  • Best Efficiency: FlashRank-MiniLM (55.43 NDCG@10, 195s)
  • 🎯 Best Balance: MonoT5-3B (60.75 NDCG@10, 486s)
Efficiency Analysis

Runtime vs. Performance trade-offs across reranking methods

🔧 Supported Methods

We evaluate 22 reranking approaches across multiple paradigms:

Pointwise Methods

  • MonoT5, RankT5, InRanker, TWOLAR
  • FlashRank, Transformer Rankers
  • UPR, MonoBERT, ColBERT

Listwise Methods

  • RankGPT, ListT5, Zephyr, Vicuna
  • LiT5-Distill, InContext Rerankers

Pairwise Methods

  • PRP (Pairwise Ranking Prompting)
  • EchoRank

Advanced Methods

  • Setwise (Flan-T5 variants)
  • TourRank (Tournament-based)
  • RankLLaMA (Task-specific fine-tuned)

🔄 Dataset Updates

FutureQueryEval will be updated every 6 months with new queries about recent events to maintain temporal novelty. Subscribe to releases for notifications!

Upcoming Updates

  • Version 1.1 (December 2025): +100 queries from July-September 2025 events
  • Version 1.2 (June 2026): +100 queries from October 2025-March 2026 events

📋 Leaderboard

Submit your reranking method results to appear on our leaderboard! See SUBMISSION.md for guidelines.

Current standings available at: RanArena

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for: - Adding new reranking methods - Improving evaluation metrics
- Dataset quality improvements - Bug fixes and optimizations

🎈 Citation

If you use FutureQueryEval or our evaluation framework, please cite:

bibtex @misc{abdallah2025good, title={How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models}, author={Abdelrahman Abdallah and Bhawna Piryani and Jamshid Mozafari and Mohammed Ali and Adam Jatowt}, year={2025}, eprint={2508.16757}, archivePrefix={arXiv}, primaryClass={cs.CL} }

📞 Contact


⭐ Star this repo if you find it helpful! ⭐

📧 Questions? Open an issue or contact the authors

Owner

  • Name: DataScienceUIBK
  • Login: DataScienceUIBK
  • Kind: organization

GitHub Events

Total
  • Watch event: 6
  • Member event: 1
  • Push event: 10
  • Create event: 2
Last Year
  • Watch event: 6
  • Member event: 1
  • Push event: 10
  • Create event: 2