lightning-ir

One-stop shop for running and fine-tuning transformer-based language models for retrieval

https://github.com/webis-de/lightning-ir

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

One-stop shop for running and fine-tuning transformer-based language models for retrieval

Basic Info
Statistics
  • Stars: 59
  • Watchers: 16
  • Forks: 17
  • Open Issues: 11
  • Releases: 5
Created about 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Lightning IR

lightning ir logo

Your one-stop shop for fine-tuning and running neural ranking models.


Lightning IR is a library for fine-tuning and running neural ranking models. It is built on top of PyTorch Lightning to provide a simple and flexible interface to interact with neural ranking models.

Want to:

  • fine-tune your own cross- or bi-encoder models?
  • index and search through a collection of documents with ColBERT or SPLADE?
  • re-rank documents with state-of-the-art models?

Lightning IR has you covered!

Installation

Lightning IR can be installed using pip:

pip install lightning-ir

Getting Started

See the Quickstart guide for an introduction to Lightning IR. The Documentation provides a detailed overview of the library's functionality.

The easiest way to use Lightning IR is via the CLI. It uses the PyTorch Lightning CLI and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.

The behavior of the CLI can be customized using yaml configuration files. See the configs directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.

bash lightning-ir re_rank \ --config ./configs/trainer/inference.yaml \ --config ./configs/callbacks/rank.yaml \ --config ./configs/data/re-rank-trec-dl.yaml \ --config ./configs/models/monoelectra.yaml

For more details, see the Usage section.

Usage

Command Line Interface

The CLI offers four subcommands:

``` $ lightning-ir -h Lightning Trainer command line tool

subcommands: For more details of each subcommand, add it as an argument followed by --help.

Available subcommands: fit Runs the full optimization routine. index Index a collection of documents. search Search for relevant documents. re_rank Re-rank a set of retrieved documents. ```

Configurations files need to be provided to specify model, data, and fine-tuning/inference parameters. See the configs directory for examples. Four types of configurations exists:

  • trainer: Specifies the fine-tuning/inference parameters and callbacks.
  • model: Specifies the model to use and its parameters.
  • data: Specifies the dataset(s) to use and its parameters.
  • optimizer: Specifies the optimizer parameters (only needed for fine-tuning).

Example

The following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.

Fine-tuning

To fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:

bi-encoder-fit.yaml ```yaml trainer: callbacks: - class_path: ModelCheckpoint max_epochs: 1 max_steps: 100000 data: class_path: LightningIRDataModule init_args: train_batch_size: 32 train_dataset: class_path: TupleDataset init_args: tuples_dataset: msmarco-passage/train/triples-small model: class_path: BiEncoderModule init_args: model_name_or_path: bert-base-uncased config: class_path: BiEncoderConfig loss_functions: - class_path: RankNet optimizer: class_path: AdamW init_args: lr: 1e-5 ```

bash lightning-ir fit --config bi-encoder-fit.yaml

The fine-tuned model is saved in the directory lightning_logs/version_X/huggingface_checkpoint/.

Indexing

We now assume the model from the previous fine-tuning step was moved to the directory models/bi-encoder. To index the MS MARCO passage collection with faiss using the fine-tuned model, use the following configuration file and command:

bi-encoder-index.yaml ```yaml trainer: callbacks: - class_path: IndexCallback init_args: index_config: class_path: FaissFlatIndexConfig model: class_path: BiEncoderModule init_args: model_name_or_path: models/bi-encoder data: class_path: LightningIRDataModule init_args: num_workers: 1 inference_batch_size: 256 inference_datasets: - class_path: DocDataset init_args: doc_dataset: msmarco-passage ```

bash lightning-ir index --config bi-encoder-index.yaml

The index is saved in the directory models/bi-encoder/indexes/msmarco-passage.

Searching

To search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:

bi-encoder-search.yaml ```yaml trainer: callbacks: - class_path: RankCallback model: class_path: BiEncoderModule init_args: model_name_or_path: models/bi-encoder index_dir: models/bi-encoder/indexes/msmarco-passage search_config: class_path: FaissFlatSearchConfig init_args: k: 100 evaluation_metrics: - nDCG@10 data: class_path: LightningIRDataModule init_args: num_workers: 1 inference_batch_size: 4 inference_datasets: - class_path: QueryDataset init_args: query_dataset: msmarco-passage/trec-dl-2019/judged - class_path: QueryDataset init_args: query_dataset: msmarco-passage/trec-dl-2020/judged ```

bash lightning-ir search --config bi-encoder-search.yaml

The run files are saved as models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Re-ranking

Assuming we've also fine-tuned a cross-encoder that is saved in the directory models/cross-encoder, we can re-rank the retrieved documents using the following configuration file and command:

cross-encoder-re-rank.yaml ```yaml trainer: callbacks: - class_path: RankCallback model: class_path: CrossEncoderModule init_args: model_name_or_path: models/cross-encoder evaluation_metrics: - nDCG@10 data: class_path: LightningIRDataModule init_args: num_workers: 1 inference_batch_size: 4 inference_datasets: - class_path: RunDataset init_args: run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run depth: 100 sample_size: 100 sampling_strategy: top - class_path: RunDataset init_args: run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run depth: 100 sample_size: 100 sampling_strategy: top ```

bash lightning-ir re_rank --config cross-encoder-re-rank.yaml

The run files are saved as models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Owner

  • Name: Webis
  • Login: webis-de
  • Kind: organization
  • Location: Halle / Leipzig / Paderborn / Weimar

Web Technology & Information Systems Group (Webis Group)

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use Lightning IR in your research, please cite as below."
authors:
  - family-names: "Schlatt"
    given-names: "Ferdinand"
    orcid: 0000-0002-6032-909X
  - family-names: "Fröbe"
    given-names: "Maik"
    orcid: 0000-0002-1003-981X
  - family-names: "Hagen"
    given-names: "Matthias"
    orcid: 0000-0002-9733-2890
title: "Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval"
version: 0.0.5
preferred-citation:
  type: conference-paper
  title: "Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval"
  authors:
  - family-names: "Schlatt"
    given-names: "Ferdinand"
    orcid: 0000-0002-6032-909X
  - family-names: "Fröbe"
    given-names: "Maik"
    orcid: 0000-0002-1003-981X
  - family-names: "Hagen"
    given-names: "Matthias"
    orcid: 0000-0002-9733-2890
  doi: "10.1145/3701551.3704118"
  year: 2025
  collection-title: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM '25)
  conference:
    name: Eighteenth ACM International Conference on Web Search and Data Mining
    city: Hannover
    country: Germany
    date-start: 2025-03-10
    date-end: 2025-03-14

GitHub Events

Total
  • Create event: 15
  • Release event: 4
  • Issues event: 18
  • Watch event: 41
  • Delete event: 9
  • Issue comment event: 44
  • Push event: 155
  • Pull request event: 106
  • Fork event: 13
Last Year
  • Create event: 15
  • Release event: 4
  • Issues event: 18
  • Watch event: 41
  • Delete event: 9
  • Issue comment event: 44
  • Push event: 155
  • Pull request event: 106
  • Fork event: 13

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 13
  • Total pull requests: 59
  • Average time to close issues: 2 months
  • Average time to close pull requests: about 6 hours
  • Total issue authors: 3
  • Total pull request authors: 8
  • Average comments per issue: 1.92
  • Average comments per pull request: 0.37
  • Merged pull requests: 56
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 13
  • Pull requests: 59
  • Average time to close issues: 2 months
  • Average time to close pull requests: about 6 hours
  • Issue authors: 3
  • Pull request authors: 8
  • Average comments per issue: 1.92
  • Average comments per pull request: 0.37
  • Merged pull requests: 56
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
  • fschlatt (13)
  • NielsRogge (1)
  • techthiyanes (1)
Pull Request Authors
  • fschlatt (86)
  • RaykKretzschmar (19)
  • dependabot[bot] (4)
  • hscells (2)
  • eltociear (2)
  • TheMrSheldon (2)
  • janheinrichmerker (2)
  • samiki-hub (2)
Top Labels
Issue Labels
enhancement (5) bug (1) documentation (1)
Pull Request Labels
dependencies (4) documentation (2) enhancement (2) python (2) github_actions (2) help wanted (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 57 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 5
  • Total maintainers: 1
pypi.org: lightning-ir

Your one-stop shop for fine-tuning and running neural ranking models.

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 57 Last month
Rankings
Dependent packages count: 10.3%
Average: 34.2%
Dependent repos count: 58.0%
Maintainers (1)
Last synced: 10 months ago

Dependencies

setup.py pypi