gpl

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577

https://github.com/ukplab/gpl

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary

Keywords

bert domain-adaptation information-retrieval nlp transformers vector-search
Last synced: 6 months ago · JSON representation

Repository

Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval" https://arxiv.org/abs/2112.07577

Basic Info
  • Host: GitHub
  • Owner: UKPLab
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 402 KB
Statistics
  • Stars: 336
  • Watchers: 5
  • Forks: 37
  • Open Issues: 27
  • Releases: 6
Topics
bert domain-adaptation information-retrieval nlp transformers vector-search
Created about 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme License

README.md

Generative Pseudo Labeling (GPL)

GPL is an unsupervised domain adaptation method for training dense retrievers. It is based on query generation and pseudo labeling with powerful cross-encoders. To train a domain-adapted model, it needs only the unlabeled target corpus and can achieve significant improvement over zero-shot models.

For more information, checkout our publication: - GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval (NAACL 2022)

For reproduction, please refer to this snapshot branch.

Installation

One can either install GPL via pip bash pip install gpl or via git clone bash git clone https://github.com/UKPLab/gpl.git && cd gpl pip install -e .

Meanwhile, please make sure the correct version of PyTorch has been installed according to your CUDA version.

Usage

GPL accepts data in the BeIR-format. For example, we can download the FiQA dataset hosted by BeIR: bash wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip unzip fiqa.zip head -n 2 fiqa/corpus.jsonl # One can check this data format. Actually GPL only need this `corpus.jsonl` as data input for training. Then we can either use the python -m function to run GPL training directly: ```bash export dataset="fiqa" python -m gpl.train \ --pathtogenerateddata "generated/$dataset" \ --baseckpt "distilbert-base-uncased" \ --gplscorefunction "dot" \ --batchsizegpl 32 \ --gplsteps 140000 \ --newsize -1 \ --queriesperpassage -1 \ --outputdir "output/$dataset" \ --evaluationdata "./$dataset" \ --evaluationoutput "evaluation/$dataset" \ --generator "BeIR/query-gen-msmarco-t5-base-v1" \ --retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \ --retrieverscorefunctions "cossim" "cossim" \ --crossencoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \ --qgenprefix "qgen" \ --doevaluation \ # --use_amp # Use this for efficient training if the machine supports AMP

One can run python -m gpl.train --help for the information of all the arguments

To reproduce the experiments in the paper, set base_ckpt to "GPL/msmarco-distilbert-margin-mse" (https://huggingface.co/GPL/msmarco-distilbert-margin-mse)

or import GPL's trainining method in a python script: python import gpl

dataset = 'fiqa' gpl.train( pathtogenerateddata=f"generated/{dataset}", baseckpt="distilbert-base-uncased",
# baseckpt='GPL/msmarco-distilbert-margin-mse',
# The starting checkpoint of the experiments in the paper gpl
scorefunction="dot", # Note that GPL uses MarginMSE loss, which works with dot-product batchsizegpl=32, gplsteps=140000, newsize=-1, # Resize the corpus to `newsize(|corpus|) if needed. When set to None (by default), the |corpus| will be the full size. When set to -1, the |corpus| will be set automatically: If QPP * |corpus| <= 250K, |corpus| will be the full size; else QPP will be set 3 and |corpus| will be set to 250K / 3 queries_per_passage=-1, # Number of Queries Per Passage (QPP) in the query generation step. When set to -1 (by default), the QPP will be chosen automatically: If QPP * |corpus| <= 250K, then QPP will be set to 250K / |corpus|; else QPP will be set 3 and |corpus| will be set to 250K / 3 output_dir=f"output/{dataset}", evaluation_data=f"./{dataset}", evaluation_output=f"evaluation/{dataset}", generator="BeIR/query-gen-msmarco-t5-base-v1", retrievers=["msmarco-distilbert-base-v3", "msmarco-MiniLM-L-6-v3"], retriever_score_functions=["cos_sim", "cos_sim"], # Note that these two retriever model work with cosine-similarity cross_encoder="cross-encoder/ms-marco-MiniLM-L-6-v2", qgen_prefix="qgen", # This prefix will appear as part of the (folder/file) names for query-generation results: For example, we will have "qgen-qrels/" and "qgen-queries.jsonl" by default. do_evaluation=True, # use_amp=True # One can use this flag for enabling the efficient float16 precision ) `` One can also refer to this toy example on Google Colab for better understanding how the code works.

How does GPL work?

The workflow of GPL is shown as follows: 1. GPL first use a seq2seq (we use BeIR/query-gen-msmarco-t5-base-v1 by default) model to generate queries_per_passage queries for each passage in the unlabeled corpus. The query-passage pairs are viewed as positive examples for training. > Result files (under path $path_to_generated_data): (1) ${qgen}-qrels/train.tsv, (2) ${qgen}-queries.jsonl and also (3) corpus.jsonl (copied from $evaluation_data/); 2. Then, it runs negative mining with the generated queries as input on the target corpus. The mined passages will be viewed as negative examples for training. One can specify any dense retrievers (SBERT or Huggingface/transformers checkpoints, we use msmarco-distilbert-base-v3 + msmarco-MiniLM-L-6-v3 by default) or BM25 to the argument retrievers as the negative miner. > Result file (under path $path_to_generated_data): hard-negatives.jsonl; 3. Finally, it does pseudo labeling with the powerful cross-encoders (we use cross-encoder/ms-marco-MiniLM-L-6-v2 by default.) on the query-passage pairs that we have so far (for both positive and negative examples). > Result file (under path $path_to_generated_data): gpl-training-data.tsv. It contains (gpl_steps * batch_size_gpl) tuples in total.

Up to now, we have the actual training data ready. One can look at sample-data/generated/fiqa for a quick example about the data format. The very last step is to apply the MarginMSE loss to teach the student retriever to mimic the margin scores, CE(query, positive) - CE(query, negative) labeled by the teacher model (Cross-Encoder, CE). And of course, the MarginMSE step is included in GPL and will be done automatically:). Note that MarginMSE works with dot-product and thus the final models trained with GPL works with dot-product.

PS: The --retrievers are for negative mining. They can be any dense retrievers trained on the general domain (e.g. MS MARCO) and do not need to be strong for the target task/domain. Please refer to the paper for more details (cf. Table 7).

Customized data

One can also replace/put the customized data for any intermediate step under the path $path_to_generated_data with the same name fashion. GPL will skip the intermediate steps by using these provided data.

As a typical workflow, one might only have the (English) unlabeld corpus and want a good model performing well for this corpus. To run GPL training under such condition, one just needs these steps: 1. Prepare your corpus in the same format as the data sample; 2. Put your corpus.jsonl under a folder, e.g. named as "generated" for data loading and data generation by GPL; 3. Call gpl.train with the folder path as an input argument: (other arguments work as usual) bash python -m gpl.train \ --path_to_generated_data "generated" \ --output_dir "output" \ --new_size -1 \ --queries_per_passage -1

Pre-trained checkpoints and generated data

Pre-trained checkpoints

We now release the pre-trained GPL models via the https://huggingface.co/GPL. There are currently five types of models:

  1. GPL/${dataset}-msmarco-distilbert-gpl: Model with training order of (1) MarginMSE on MSMARCO -> (2) GPL on ${dataset};
  2. GPL/${dataset}-tsdae-msmarco-distilbert-gpl: Model with training order of (1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO -> (3) GPL on ${dataset};
  3. GPL/msmarco-distilbert-margin-mse: Model trained on MSMARCO with MarginMSE;
  4. GPL/${dataset}-tsdae-msmarco-distilbert-margin-mse: Model with training order of (1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO;
  5. GPL/${dataset}-distilbert-tas-b-gpl-self_miner: Starting from the tas-b model, the models were trained with GPL on the target corpus ${dataset} with the base model itself as the negative miner (here noted as "self_miner").

Models 1. and 2. were actually trained on top of models 3. and 4. resp. All GPL models were trained the automatic setting of new_size and queries_per_passage (by setting them to -1). This automatic setting can keep the performance while being efficient. For more details, please refer to the section 4.1 in the paper.

Among these models, GPL/${dataset}-distilbert-tas-b-gpl-self_miner ones works the best on the BeIR benchmark:

For reproducing the results with the same package versions used in the experiments, please refer to the conda environment file, environment.yml.

Generated data

We now release the generated data used in the experiments of the GPL paper:

  1. The generated data for the main experiments on the 6 BeIR datasets: https://public.ukp.informatik.tu-darmstadt.de/kwang/gpl/generated-data/main/;
  2. The generated data for the experiments on the full 18 BeIR datasets: https://public.ukp.informatik.tu-darmstadt.de/kwang/gpl/generated-data/beir.

Please note that the 4 datasets of bioasq, robust04, trec-news and signal1m are only available after registration with the original official authorities. We only release the document IDs for these corpora with the file name corpus.doc_ids.txt. For more details, please refer to the BeIR repository.

Citation

If you use the code for evaluation, feel free to cite our publication GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval: bibtex @article{wang2021gpl, title = "GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval", author = "Kexin Wang and Nandan Thakur and Nils Reimers and Iryna Gurevych", journal= "arXiv preprint arXiv:2112.07577", month = "4", year = "2021", url = "https://arxiv.org/abs/2112.07577", }

Contact person and main contributor: Kexin Wang, kexin.wang.2049@gmail.com

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Owner

  • Name: Ubiquitous Knowledge Processing Lab
  • Login: UKPLab
  • Kind: organization
  • Location: Darmstadt, Germany

GitHub Events

Total
  • Watch event: 13
  • Fork event: 1
Last Year
  • Watch event: 13
  • Fork event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 44
  • Total Committers: 2
  • Avg Commits per committer: 22.0
  • Development Distribution Score (DDS): 0.023
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
kwang2049 k****9@g****m 43
dpetrak d****k@g****m 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 32
  • Total pull requests: 10
  • Average time to close issues: 23 days
  • Average time to close pull requests: 11 minutes
  • Total issue authors: 26
  • Total pull request authors: 3
  • Average comments per issue: 2.44
  • Average comments per pull request: 0.1
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Matthieu-Tinycoaching (4)
  • kingafy (2)
  • alsbhn (2)
  • junefeld (2)
  • iamyihwa (1)
  • GuodongFan (1)
  • houssine2000 (1)
  • edgar2597 (1)
  • junebug-junie (1)
  • ymurong (1)
  • HHousen (1)
  • wduo (1)
  • SnoozingSimian (1)
  • ahadda5 (1)
  • adrienohana (1)
Pull Request Authors
  • kwang2049 (8)
  • cmacdonald (1)
  • kbraun-axio (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 193 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 2
  • Total versions: 14
  • Total maintainers: 1
pypi.org: gpl

GPL is an unsupervised domain adaptation method for training dense retrievers. It is based on query generation and pseudo labeling with powerful cross-encoders. To train a domain-adapted model, it needs only the unlabeled target corpus and can achieve significant improvement over zero-shot models.

  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 2
  • Downloads: 193 Last month
Rankings
Stargazers count: 3.8%
Forks count: 6.8%
Average: 8.7%
Dependent packages count: 10.0%
Downloads: 11.1%
Dependent repos count: 11.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • beir *
  • easy-elasticsearch >=0.0.7
environment.yml conda
  • _libgcc_mutex 0.1
  • _openmp_mutex 5.1
  • blas 1.0
  • bzip2 1.0.8
  • ca-certificates 2022.07.19
  • certifi 2021.5.30
  • cudatoolkit 11.3.1
  • dataclasses 0.8
  • ffmpeg 4.3
  • freetype 2.11.0
  • gmp 6.2.1
  • gnutls 3.6.15
  • intel-openmp 2022.1.0
  • jpeg 9e
  • lame 3.100
  • lcms2 2.12
  • ld_impl_linux-64 2.38
  • lerc 3.0
  • libdeflate 1.8
  • libffi 3.3
  • libgcc-ng 11.2.0
  • libgomp 11.2.0
  • libiconv 1.16
  • libidn2 2.3.2
  • libpng 1.6.37
  • libstdcxx-ng 11.2.0
  • libtasn1 4.16.0
  • libtiff 4.4.0
  • libunistring 0.9.10
  • libuv 1.40.0
  • libwebp-base 1.2.2
  • lz4-c 1.9.3
  • mkl 2020.2
  • mkl-service 2.3.0
  • mkl_fft 1.3.0
  • mkl_random 1.1.1
  • ncurses 6.3
  • nettle 3.7.3
  • numpy-base 1.19.2
  • olefile 0.46
  • openh264 2.1.1
  • openjpeg 2.4.0
  • openssl 1.1.1q
  • pip 21.2.2
  • python 3.6.13
  • pytorch 1.10.2
  • pytorch-mutex 1.0
  • readline 8.1.2
  • setuptools 58.0.4
  • six 1.16.0
  • sqlite 3.39.2
  • tk 8.6.12
  • torchaudio 0.10.2
  • torchvision 0.11.3
  • typing_extensions 4.1.1
  • wheel 0.37.1
  • xz 5.2.5
  • zlib 1.2.12
  • zstd 1.5.2