https://github.com/google-research/t5x_retrieval

Last synced: 8 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: google-research
License: apache-2.0
Language: Python
Default Branch: master
Size: 836 KB

Statistics

Stars: 101
Watchers: 6
Forks: 10
Open Issues: 4
Releases: 0

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Contributing License

T5X Retrieval

T5X Retrieval is a JAX implementation of T5 (Text-to-Text Transfer Transformer) optimized for retrieval applications. It is built on top of T5 on JAX, aka T5X. This is targeted at Natural Language Understanding researchers as well as application developers who are aiming to use the latest T5-based Transformer models for search, retrieval and ranking applications, but in the JAX framework as opposed to TensorFlow.

T5X Retrieval is an efficient training and evaluation framework that supports transformer-based neural retrieval and ranking models such as sentence encoders and dense retrieval models. It supports multi-pod large model training, large cross-batch negatives and the capability to initialize from any pre-trained model trained using T5X.

This launch open sources the training and inference code, including references to TFDS for training data, actual model training code (Python JAX & Flaxformer), pre-trained models and basic inference example code. This end-to-end example model code is meant to accompany the SentenceT5 and Generalizable T5 Retrieval models that includes the implementation and performance on relevant benchmarks.

What's here

configs/*.gin - Model configurations
tasks.py - Task definitions that generate the dataset.
feature_converters.py - Converters that transform the task features from the dataset to model features
models.py - High-level models, such as DualEncoderDecoderModel, that take the feature converters ouputs as inputs.

For more details about the training pipeline and task definitions, you can check out T5X and Seqio.

Quickstart (Recommended)

T5X Retrieval supports the training and evaluation options provided by T5X. It can be run with XManager on Vertex AI, which is a platform for training that creates TPU instances and runs code on the TPUs.

We briefly summarized steps to quickly start the training and inference jobs. You can find more details at the T5X Quickstart.

Create a GCP project. Create the bucket to store data and models.
Follow the pre-requisites and directions to install XManager.
[Optional] GCP projects come with 8 cores by default, which is enough to run one training experiment on a single TPU host. Request TPU quota as required if you want to run multi-host training or multiple runs in parallel.
Install all dependencies such as T5X, Flaxformer, TFDS.
Launch the xmanager script located at t5x/scripts/xm_launch.py.

As a running example, we use the BEIR MS Marco dataset.

```sh

Export GOOGLECLOUDBUCKET_NAME to a proper value.

export GOOGLECLOUDBUCKETNAME=... export TFDSDATADIR=gs://$GOOGLECLOUDBUCKETNAME/t5xretrieval/data export MODELDIR=gs://$GOOGLECLOUDBUCKETNAME/t5xretrieval/$(date +%Y%m%d)

Install dependencies.

git clone https://github.com/google-research/t5x /tmp/t5x git clone https://github.com/google-research/t5xretrieval /tmp/t5xretrieval git clone https://github.com/google/flaxformer /tmp/flaxformer git clone https://github.com/google/aqt.git /tmp/aqt

cd /tmp/t5x/

python3 t5x/scripts/xmlaunch.py \ --pipinstall="apachebeam[gcp]" \ --modeldir=gs://$GOOGLECLOUDBUCKETNAME/t5x/msmarcoft$(date +%Y%m%d) \ --tfdsdatadir=gs://$GOOGLECLOUDBUCKETNAME/t5x/data \ --projectdirs=/tmp/t5xretrieval/t5xretrieval,/tmp/flaxformer/flaxformer,/tmp/aqt/aqt \ --ginfile=t5xretrieval/configs/models/det5base.gin \ --gin.INITIALCHECKPOINTPATH=\"gs://t5-data/pretrainedmodels/t5x/t5base/checkpoint999900\" \ --ginfile=t5xretrieval/configs/runs/finetune.gin \ --gin.TRAINSTEPS=1009900 \ --gin.utils.createlearningratescheduler.stepoffset=999900 \ --gin.utils.createlearningratescheduler.warmupsteps=1000 \ --gin.utils.createlearningratescheduler.decayfactor=0.00000125 \ --gin.USECACHEDTASKS=False \ --gin.models.DualEncoderModel.usenegatives=False \ --gin.train.evalperiod=500 \ --gin.utils.SaveCheckpointConfig.keep=10 \ --gin.utils.SaveCheckpointConfig.period=500 \ --gin.train/DatasetConfig.batchsize=512 \ --gin.MIXTUREORTASKNAME="'beirmsmarcoretrieval'" \ --gin.MIXTUREORTASKMODULE="'t5xretrieval.tasks'" \ --gin.TASKFEATURE_LENGTHS="{'inputs': 64, 'targets': 256}" ```

Notes:

Check gs://$GOOGLE_CLOUD_BUCKET_NAME/t5x/ for the output artifacts, which can be read by TensorBoard.
Add --pip_install="apache_beam[gcp]" to the script if you have not downloaded the dataset before hand.
The TRAIN_STEPS = step_offset + real_train_steps, where step_offset is the step of the loaded checkpoint while the real_train_step is the steps that the model will be trained for.

Models

Sentence encoders

SentenceT5 is a family of high performing sentence encoders trained using T5X Retrieval. The sentenceT5 models encode text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language processing tasks.

SentenceT5 models are built on top of the Text-To-Text Transfer Transformer (T5). It is trained on a variety of data sources and initialized from pre-trained T5 models with different model sizes as described in [1]. The input is variable-length English text and the output is a 768-dimensional vector. Note that there's no hard length limit for T5 (i.e., no 512 tokens limit as in BERT), but that it's been trained to produce good embeddings for approximately sentence length text.

Metrics

We evaluate this model on the SentEval sentence representation benchmark.

Transfer tasks | MR | CR | SUBJ | MPQA | SST | TREC | MRPC | Average :------------------------------------------------------------ | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ------: ST5-Base | 85.8 | 92.1 | 94.6 | 90.9 | 91.8 | 96.4 | 75.2 | 89.5 ST5-Large | 88.9 | 93.5 | 95.4 | 91.5 | 94.2 | 96.2 | 77.1 | 91.0 ST5-3B | 89.9 | 94.1 | 95.9 | 91.6 | 94.8 | 96.2 | 77.9 | 91.5 ST5-11B | 90.8 | 94.4 | 96.3 | 91.7 | 94.8 | 95.4 | 77.9 | 91.6

STS tasks | STS12 | STS13 | STS14 | STS15 | STS16 | STSb | SICK-R | Average :------------------------------------------------------------ | ----: | ----: | ----: | ----: | ----: | ---: | -----: | ------: ST5-Base | 78.1. | 85.8 | 82.2 | 87.5 | 84.0 | 86.0 | 79.8 | 83.3 ST5-Large | 79.1 | 87.3 | 83.2 | 88.3 | 84.4 | 86.7 | 79.8 | 84.1 ST5-3B | 79.0 | 88.8 | 84.3 | 88.9 | 85.3 | 86.3 | 79.5 | 84.6 ST5-11B | 80.1 | 88.8 | 84.7 | 88.9 | 85.2 | 86.8 | 80.4 | 85.0

More details about the evaluations can be found in the paper [1].

Dense retrieval models

The Generalizable T5 Retrieval models are dual encoders that encode two pieces of text into two dense vectors respectively [2]. This is typically used to encode a query and a document to compute their similarity for dense retrieval.

GTR models are built on top of T5 (i.e. the Text-To-Text Transfer Transformer). The GTR-Base model employs a 12-layer transformer architecture, which is the same as the T5 base model. The model is first initialized from the pre-trained T5 checkpoint. It is then further pre-trained with a set of community question-answer pairs we collected. Finally, the model is fine-tuned on the MS Marco dataset.

The two encoders are shared so the GTR model functions as a single text encoder. The input is variable-length English text and the output is a 768-dimensional vector.

Metrics

We evaluate on the BEIR benchmark and report the Recall@100.

Dataset \ Model ---------------- MS MARCO | 0.898 Trec-Covid | 0.411 BioASQ | 0.441 NFCorpus | 0.275 NQ | 0.893 HotpotQA | 0.676 FiQA-2018 | 0.670 Signal-1M | 0.263 Trec-News | 0.475 Robust04 | 0.324 ArguAna | 0.974 Touché-2020 | 0.281 Quora | 0.996 DBPedia-entity | 0.418 SCIDOCS | 0.340 Fever | 0.923 Climate-Fever | 0.522 SciFact | 0.872 CQADupStack | 0.681 Avg | 0.596 Avg w/o MS MARCO | 0.580 | GTR-Base | GTR-Large | GTR-XL | GTR-XXL | ------------ | ----------------------------------------------------- | ----------------------------------------------- | ------------------------------------------------- | 0.908 | 0.911 | 0.916 | 0.434 | 0.457 | 0.407 | 0.490 | 0.483 | 0.483 | 0.298 | 0.318 | 0.300 | 0.930 | 0.936 | 0.946 | 0.725 | 0.739 | 0.752 | 0.742 | 0.755 | 0.780 | 0.261 | 0.268 | 0.268 | 0.525 | 0.512 | 0.544 | 0.365 | 0.364 | 0.372 | 0.978 | 0.980 | 0.983 | 0.282 | 0.297 | 0.301 | 0.996 | 0.997 | 0.997 | 0.480 | 0.480 | 0.494 | 0.358 | 0.358 | 0.366 | 0.941 | 0.944 | 0.947 | 0.552 | 0.569 | 0.556 | 0.899 | 0.911 | 0.900 | 0.714 | 0.729 | 0.740 | 0.625 | 0.632 | 0.634 | 0.609 | 0.616 | 0.619

Released Model Checkpoints

We have released the following checkpoints for SentenceT5 and GTR pre-trained models:

SentenceT5-Base (config, 110M parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/st5base
SentenceT5-Large (config, 335M parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/st5large
SentenceT5-XL (config, 1.24B parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/st5xl
SentenceT5-XXL (config, 4.8B parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/st5xxl
GTR-Base (config, 110M parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/gtrbase
GTR-Large (config, 335M parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/gtrlarge
GTR-XL (config, 1.24B parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/gtrxl
GTR-XXL (config, 4.8B parameters): gs://t5-data/pretrainedmodels/t5x/retrieval/gtrxxl

References

[1] Jianmo, Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. ACL 2022.

[2] Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Zhao, Yi Luan, Keith B. Hall, Ming-wei Chang, Yinfei Yang. Large Dual Encoders Are Generalizable Retrievers. December 2021.

This is not an officially supported Google product.

Owner

Name: Google Research
Login: google-research
Kind: organization
Location: Earth

Website: https://research.google
Repositories: 226
Profile: https://github.com/google-research

GitHub Events

Total

Watch event: 3
Fork event: 1

Last Year

Watch event: 3
Fork event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 26
Total Committers: 1
Avg Commits per committer: 26.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
T5X Retrieval Team	n**y@g**m	26

Committer Domains (Top 20 + Academic)

google.com: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 6
Total pull requests: 0
Average time to close issues: 4 months
Average time to close pull requests: N/A
Total issue authors: 5
Total pull request authors: 0
Average comments per issue: 0.67
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

littlewine (1)
emilysilcock (1)
1024er (1)
Chroner2 (1)
lhbonifacio (1)

https://github.com/google-research/t5x_retrieval

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

T5X Retrieval

What's here

Quickstart (Recommended)

Export GOOGLECLOUDBUCKET_NAME to a proper value.

Install dependencies.

Models

Sentence encoders

Metrics

Dense retrieval models

Metrics

Released Model Checkpoints

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels