adaptive-relevance-margin-loss

Code repository for the paper "Learning Effective Representations for Retrieval using Self-Distillation with Adaptive Relevance Margins".

https://github.com/webis-de/adaptive-relevance-margin-loss

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Code repository for the paper "Learning Effective Representations for Retrieval using Self-Distillation with Adaptive Relevance Margins".

Basic Info

Host: GitHub
Owner: webis-de
License: mit
Language: Python
Default Branch: main
Size: 26.4 KB

Statistics

Stars: 1
Watchers: 14
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

Learning Effective Representations for Retrieval using Self-Distillation with Adaptive Relevance Margins

Overview

This is the code repository for the paper "Learning Effective Representations for Retrieval using Self-Distillation with Adaptive Relevance Margins".

Representation-based retrieval models, so-called bi-encoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art bi-encoders are trained using an expensive training regime involving knowledge distillation from a teacher model and extensive batch-sampling techniques. Instead of relying on a teacher model, we contribute a novel parameter-free loss function for self-supervision that exploits the pre-trained text similarity capabilities of the encoder model as a training signal, eliminating the need for batch sampling by performing implicit hard negative mining. We explore the capabilities of our proposed approach through extensive ablation studies, demonstrating that self-distillation can match the effectiveness of teacher-distillation approaches while requiring only a fraction of the data and compute.

Supplementary data (TREC-format run files for all final trained models) is hosted on Zenodo.

Project Organization

├── Dockerfile <- Dockerfile with all dependencies for reproducible execution ├── LICENSE <- License file ├── Makefile <- Makefile with commands to reproduce artifacts (data + models) ├── README.md <- The top-level README for project ├── configs <- Configuration files for model and sweep parameters ├── data <- Data folder; will be populated by data scripts ├── main.py <- Main Lightning CLI entrypoint ├── requirements.txt <- Dependencies ├── scripts <- Scripts to automate single tasks (data parsing, sweep agents, ...) ├── setup.py <- Makes project pip installable (pip install -e .) so src can be imported └── src <- Model source code

Replication

Data, model training, and evaluation is replicable with make targets:

``` $ make Available rules:

requirements Install Python Dependencies data-train Download and preprocess train dataset data-eval Download and preprocess eval datasets fit Run the training process eval Run eval process clean Delete all compiled Python files

```

These can be run in the given order to fully replicate the experimental pipeline.

Each training run from the paper can be executed with its given config file in configs with the following command: sh python3 main.py fit -c <path-to-config-file>

Citation

If you use this code in your research, please cite:

bib @InProceedings{gienapp:2025b, address = {New York}, author = {Lukas Gienapp and Niklas Deckers and Martin Potthast and Harrisen Scells}, booktitle = {15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025)}, doi = {10.1145/3731120.3744594}, editor = {Hamed Zamani and Laura Dietz and Benjamin Piwowarski and Sebastian Bruch}, isbn = {979-8-4007-1861-8/2025/07}, month = jul, pages = {275--285}, publisher = {ACM}, site = {Padua, Italy}, title = {{Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins}}, year = 2025 }

Owner

Name: Webis
Login: webis-de
Kind: organization
Location: Halle / Leipzig / Paderborn / Weimar

Website: https://webis.de
Twitter: webis_de
Repositories: 194
Profile: https://github.com/webis-de

Web Technology & Information Systems Group (Webis Group)

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Learning Effective Representations for Retrieval using
  Self-Distillation with Adaptive Relevance Margins
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Lukas
    family-names: Gienapp
    affiliation: Leipzig University & ScaDS.AI
    orcid: 'https://orcid.org/0000-0001-5707-3751'
  - given-names: Niklas
    family-names: Deckers
    affiliation: University of Kassel & ScaDS.AI & hessian.AI
    orcid: 'https://orcid.org/0000-0001-6803-1223'
  - given-names: Harrisen
    family-names: Scells
    affiliation: Leipzig University
    orcid: 'https://orcid.org/0000-0001-9578-7157'
  - given-names: Martin
    family-names: Potthast
    affiliation: University of Kassel & ScaDS.AI & hessian.AI
    orcid: 'https://orcid.org/0000-0003-2451-0665'
abstract: >
  Representation-based retrieval models, so-called
  bi-encoders, estimate the relevance of a document to a
  query by calculating the similarity of their respective
  embeddings. Current state-of-the-art bi-encoders are
  trained using an expensive training regime involving
  knowledge distillation from a teacher model and extensive
  batch-sampling techniques. Instead of relying on a teacher
  model, we contribute a novel parameter-free loss function
  for self-supervision that exploits the pre-trained text
  similarity capabilities of the encoder model as a training
  signal, eliminating the need for batch sampling by
  performing implicit hard negative mining. We explore the
  capabilities of our proposed approach through extensive
  ablation studies, demonstrating that self-distillation can
  match the effectiveness of teacher-distillation approaches
  while requiring only a fraction of the data and compute.
license: MIT

GitHub Events

Total

Watch event: 1
Push event: 1

Last Year

Watch event: 1
Push event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 14
Average time to close issues: N/A
Average time to close pull requests: 26 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.07
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 14

Past Year

Issues: 0
Pull requests: 14
Average time to close issues: N/A
Average time to close pull requests: 26 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.07
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 14

View more stats

Top Authors

Issue Authors

Pull Request Authors

dependabot[bot] (10)

Top Labels

Issue Labels

Pull Request Labels

dependencies (10)

Dependencies

Dockerfile docker

pytorch/pytorch 2.1.0-cuda11.8-cudnn8-runtime build

requirements.txt pypi

GitPython ==3.1.41
Jinja2 ==3.1.3
MarkupSafe ==2.1.3
PyJWT ==2.8.0
PyYAML ==6.0.1
Pygments ==2.17.2
Send2Trash ==1.8.2
aiobotocore ==2.7.0
aiohttp ==3.9.1
aioitertools ==0.11.0
aiosignal ==1.3.1
annotated-types ==0.6.0
antlr4-python3-runtime ==4.9.3
anyio ==4.2.0
appdirs ==1.4.4
arrow ==1.3.0
async-timeout ==4.0.3
attrs ==23.2.0
backoff ==2.2.1
beautifulsoup4 ==4.12.2
bitsandbytes ==0.42.0
blessed ==1.20.0
boto3 ==1.28.64
botocore ==1.31.64
certifi ==2023.11.17
charset-normalizer ==3.3.2
click ==8.1.7
contourpy ==1.2.0
croniter ==1.4.1
cycler ==0.12.1
datasets ==2.16.1
dateutils ==0.6.12
deepdiff ==6.7.1
dill ==0.3.7
docker ==6.1.3
docker-pycreds ==0.4.0
docstring-parser ==0.15
editor ==1.6.5
exceptiongroup ==1.2.0
faiss-gpu ==1.7.2
fastapi ==0.109.0
filelock ==3.13.1
fonttools ==4.47.2
frozenlist ==1.4.1
fsspec ==2023.10.0
gitdb ==4.0.11
h11 ==0.14.0
huggingface-hub ==0.20.2
hydra-core ==1.3.2
idna ==3.6
importlib-resources ==6.1.1
inquirer ==3.2.1
jmespath ==1.0.1
jsonargparse ==4.27.1
kiwisolver ==1.4.5
lightning ==2.1.3
lightning-api-access ==0.0.5
lightning-cloud ==0.5.57
lightning-fabric ==2.1.3
lightning-utilities ==0.10.0
markdown-it-py ==3.0.0
matplotlib ==3.8.2
mdurl ==0.1.2
mpmath ==1.3.0
multidict ==6.0.4
multiprocess ==0.70.15
networkx ==3.2.1
numpy ==1.26.3
nvidia-cublas-cu12 ==12.1.3.1
nvidia-cuda-cupti-cu12 ==12.1.105
nvidia-cuda-nvrtc-cu12 ==12.1.105
nvidia-cuda-runtime-cu12 ==12.1.105
nvidia-cudnn-cu12 ==8.9.2.26
nvidia-cufft-cu12 ==11.0.2.54
nvidia-curand-cu12 ==10.3.2.106
nvidia-cusolver-cu12 ==11.4.5.107
nvidia-cusparse-cu12 ==12.1.0.106
nvidia-nccl-cu12 ==2.18.1
nvidia-nvjitlink-cu12 ==12.3.101
nvidia-nvtx-cu12 ==12.1.105
omegaconf ==2.3.0
ordered-set ==4.1.0
packaging ==23.2
pandas ==2.1.4
pillow ==10.2.0
protobuf ==4.25.2
psutil ==5.9.7
pyarrow ==14.0.2
pyarrow-hotfix ==0.6
pydantic ==2.5.3
pydantic_core ==2.14.6
pyparsing ==3.1.1
python-dateutil ==2.8.2
python-git ==2018.2.1
python-multipart ==0.0.6
pytorch-lightning ==2.1.3
pytz ==2023.3.post1
readchar ==4.0.5
redis ==5.0.1
regex ==2023.12.25
requests ==2.31.0
rich ==13.7.0
runs ==1.2.0
s3fs ==2023.10.0
s3transfer ==0.7.0
safetensors ==0.4.1
scipy ==1.11.4
sentry-sdk ==1.39.2
setproctitle ==1.3.3
six ==1.16.0
smmap ==5.0.1
sniffio ==1.3.0
soupsieve ==2.5
starlette ==0.35.1
sympy ==1.12
tensorboardX ==2.6.2.2
tokenizers ==0.15.0
torch ==2.1.2
torchaudio ==2.1.2
torchmetrics ==1.3.0
torchvision ==0.16.2
tqdm ==4.66.1
traitlets ==5.14.1
transformers ==4.36.2
triton ==2.1.0
types-python-dateutil ==2.8.19.20240106
typeshed-client ==2.4.0
typing_extensions ==4.9.0
tzdata ==2023.4
urllib3 ==2.0.7
uvicorn ==0.26.0
wandb ==0.16.2
wcwidth ==0.2.13
websocket-client ==1.7.0
websockets ==11.0.3
wrapt ==1.16.0
xmod ==1.8.1
xxhash ==3.4.1
yarl ==1.9.4

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science