sgpt

SGPT: GPT Sentence Embeddings for Semantic Search

Keywords

gpt information-retrieval language-model large-language-models neural-search retrieval semantic-search sentence-embeddings sgpt text-embedding

Last synced: 6 months ago · JSON representation ·

Repository

SGPT: GPT Sentence Embeddings for Semantic Search

Basic Info

Host: GitHub
Owner: Muennighoff
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://arxiv.org/abs/2202.08904
Size: 17.4 MB

Statistics

Stars: 867
Watchers: 8
Forks: 54
Open Issues: 28
Releases: 0

Topics

gpt information-retrieval language-model large-language-models neural-search retrieval semantic-search sentence-embeddings sgpt text-embedding

Created about 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

SGPT: GPT Sentence Embeddings for Semantic Search

This repository contains code, results & pre-trained models for the paper SGPT: GPT Sentence Embeddings for Semantic Search.

**************************** Updates ****************************

2024-02: We released GRIT & GritLM - These models unify SGPT Bi-Encoders, Cross-Encoders, symmetric, asymmetric, and regular GPT (i.e. generation) all in 1 single model at much better performance on all accounts. We recommend switching to these new models :)
2022-09: SGPT Bi-Encoders are now easy to use with Sentence Transformers, see new scripts
2022-08: Multilingual BLOOM SGPT models were released: Asymmetric, 7.1B parameters & Symmetric, 1.7B parameters. Feel free to open an issue if you need a different model.
2022-06: OpenAI released the mechanism of their Search Endpoint that we compared to SGPT Cross-Encoders in the paper. Our methods are very similar. Feel free to test their prompt as seen in crossencoder/beir/openai_search_endpoint_functionality.py!
2022-03: 5.8B Bi-Encoder models are now 4% & 1% better on USEB & BEIR, respectively. Paper & models on HF have been updated. This has been done by using larger batch sizes with GradCache, see the paper for more info. If you have previously downloaded them, we recommend replacing it with the new version.
2022-02: We released our paper. Check it out! :)

Quick Links

Overview
Structure
Use SGPT with Huggingface
- Bi-Encoder
  - Symmetric Semantic Search BE
  - Asymmetric Semantic Search BE
- Cross-Encoder
  - Asymmetric Semantic Search CE
  - Symmetric Semantic Search CE
Use SGPT with Sentence Transformers
- Bi-Encoder ST
  - Symmetric Semantic Search BE ST
  - Asymmetric Semantic Search BE ST
    - SGPT Sentence Transformers
    - Original Sentence Transformers
Acknowledgements
Citation

Overview

We present SGPT-BE and SGPT-CE for applying GPT models as Bi-Encoders or Cross-Encoders to symmetric or asymmetric search. SGPT-BE produces semantically meaningful sentence embeddings by contrastive fine-tuning of only bias tensors and position-weighted mean pooling. SGPT-CE uses log probabilities from GPT models without any fine-tuning. An illustration of the methods follows.

Feel free to open an issue should you have any questions~

Structure

bash . ├── biencoder # Training & Inference of Bi-Encoders │ ├── beir │ │ ├── custommodels # Directory providing BEIR compatibility for asymmetric mdoels & models with special tokens │ │ │ └── ... │ │ ├── io_utils # Exclusively used for beir_openai_embeddings_batched_parallel.py │ │ │ └── ... │ │ ├── parallelizer # Exclusively used for beir_openai_embeddings_batched_parallel.py │ │ │ └── ... │ │ ├── beir_dense_retriever.py │ │ ├── beir_openai_embeddings_batched_parallel.py │ │ ├── requirements.txt │ │ ├── *.bash # Bash scripts to run multiple experiments │ │ └── README.md │ ├── nli_msmarco │ │ ├── sentence-transformers # An adapted version of sentence-transformers - Install this version for all biencoder experiments │ │ │ └── ... │ │ └── README.md │ └── useb │ ├── useb │ │ └── ... │ ├── *.bash # Bash scripts to run multiple experiments │ ├── useb_dense_retriever.py │ └── README.md ├── crossencoder # Inference of Cross-Encoders │ └── beir │ ├── *.ipynb # Notebooks explained in the README │ └── README.md ├── other │ ├── sgpt_graphic.png │ └── sgpt_utils.ipynb # Code for creating the graphs in the paper & other ├── requirements.txt └── README.md

Each data sub-directory provides its own README with an overview of its Structure, Downloads (Datasets, Models) & Commands used to produce the datasets, models & other things. Generally, you can find all models at https://huggingface.co/Muennighoff and json results in various datasets at https://www.kaggle.com/muennighoff/datasets. Model names are explained in their Huggingface READMEs. Dataset names are explained in the sub-folders of this repository.

Use SGPT with Huggingface

Below we provide python examples to use the pre-trained models for your own semantic search use case. We highly recommend replacing the model names with larger models, e.g. Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit for biencoder/symmetric.

Bi-Encoder

Symmetric Semantic Search BE

```python import torch from transformers import AutoModel, AutoTokenizer from scipy.spatial.distance import cosine

Get our models - The package will take care of downloading the models automatically

For best performance: Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit

tokenizer = AutoTokenizer.frompretrained("Muennighoff/SGPT-125M-weightedmean-nli-bitfit") model = AutoModel.frompretrained("Muennighoff/SGPT-125M-weightedmean-nli-bitfit")

Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)

model.eval()

Tokenize input texts

texts = [ "deep learning", "artificial intelligence", "deep diving", "artificial snow", ] batchtokens = tokenizer(texts, padding=True, truncation=True, returntensors="pt")

Get the embeddings

with torch.nograd(): # Get hidden state of shape [bs, seqlen, hiddim] lasthiddenstate = model(**batchtokens, outputhiddenstates=True, returndict=True).lasthidden_state

Get weights of shape [bs, seqlen, hiddim]

weights = ( torch.arange(start=1, end=lasthiddenstate.shape[1] + 1) .unsqueeze(0) .unsqueeze(-1) .expand(lasthiddenstate.size()) .float().to(lasthiddenstate.device) )

Get attn mask of shape [bs, seqlen, hiddim]

inputmaskexpanded = ( batchtokens["attentionmask"] .unsqueeze(-1) .expand(lasthiddenstate.size()) .float() )

Perform weighted mean pooling across seqlen: bs, seqlen, hiddendim -> bs, hiddendim

sumembeddings = torch.sum(lasthiddenstate * inputmaskexpanded * weights, dim=1) summask = torch.sum(inputmaskexpanded * weights, dim=1)

embeddings = sumembeddings / summask

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

cosinesim01 = 1 - cosine(embeddings[0], embeddings[1]) cosinesim02 = 1 - cosine(embeddings[0], embeddings[2]) cosinesim0_3 = 1 - cosine(embeddings[0], embeddings[3])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosinesim01)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosinesim02)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[3], cosinesim0_3)) ```

Asymmetric Semantic Search BE

```python import torch from transformers import AutoModel, AutoTokenizer from scipy.spatial.distance import cosine

Get our models - The package will take care of downloading the models automatically

For best performance: Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit

tokenizer = AutoTokenizer.frompretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit") model = AutoModel.frompretrained("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")

Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)

model.eval()

queries = [ "I'm searching for a planet not too far from Earth.", ]

docs = [ "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.", "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.", "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.", ]

SPECBQUEBOS = tokenizer.encode("[", addspecialtokens=False)[0] SPECBQUEEOS = tokenizer.encode("]", addspecialtokens=False)[0]

SPECBDOCBOS = tokenizer.encode("{", addspecialtokens=False)[0] SPECBDOCEOS = tokenizer.encode("}", addspecialtokens=False)[0]

def tokenizewithspecb(texts, isquery): # Tokenize without padding batchtokens = tokenizer(texts, padding=False, truncation=True)
# Add special brackets & pay attention to them for seq, att in zip(batchtokens["inputids"], batchtokens["attentionmask"]): if isquery: seq.insert(0, SPECBQUEBOS) seq.append(SPECBQUEEOS) else: seq.insert(0, SPECBDOCBOS) seq.append(SPECBDOCEOS) att.insert(0, 1) att.append(1) # Add padding batchtokens = tokenizer.pad(batchtokens, padding=True, returntensors="pt") return batch_tokens

def getweightedmeanembedding(batchtokens, model): # Get the embeddings with torch.nograd(): # Get hidden state of shape [bs, seqlen, hiddim] lasthiddenstate = model(**batchtokens, outputhiddenstates=True, returndict=True).lasthiddenstate

# Get weights of shape [bs, seq_len, hid_dim]
weights = (
    torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
    .unsqueeze(0)
    .unsqueeze(-1)
    .expand(last_hidden_state.size())
    .float().to(last_hidden_state.device)
)

# Get attn mask of shape [bs, seq_len, hid_dim]
input_mask_expanded = (
    batch_tokens["attention_mask"]
    .unsqueeze(-1)
    .expand(last_hidden_state.size())
    .float()
)

# Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
sum_mask = torch.sum(input_mask_expanded * weights, dim=1)

embeddings = sum_embeddings / sum_mask

return embeddings

queryembeddings = getweightedmeanembedding(tokenizewithspecb(queries, isquery=True), model) docembeddings = getweightedmeanembedding(tokenizewithspecb(docs, isquery=False), model)

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

cosinesim01 = 1 - cosine(queryembeddings[0], docembeddings[0]) cosinesim02 = 1 - cosine(queryembeddings[0], docembeddings[1]) cosinesim03 = 1 - cosine(queryembeddings[0], doc_embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[0][:20] + "...", cosinesim01)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[1][:20] + "...", cosinesim02)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[2][:20] + "...", cosinesim0_3)) ```

Cross-Encoder

Asymmetric Semantic Search CE

```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from scipy.spatial.distance import cosine

Get models - The package will take care of downloading the models automatically

For best performance: EleutherAI/gpt-j-6B

tokenizer = AutoTokenizer.frompretrained("EleutherAI/gpt-neo-125M") model = AutoModelForCausalLM.frompretrained("EleutherAI/gpt-neo-125M")

Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)

model.eval()

prompt = 'Documents are searched to find matches with the same content.\nThe document "{}" is a good search result for "'

queries = [ "I'm searching for a planet not too far from Earth.", ]

docs = [ "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.", "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.", "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.", ]

for query in queries: print(f"Query: {query}") for doc in docs: context = prompt.format(doc)

    context_enc = tokenizer.encode(context, add_special_tokens=False)
    continuation_enc = tokenizer.encode(query, add_special_tokens=False)
    # Slice off the last token, as we take its probability from the one before
    model_input = torch.tensor(context_enc+continuation_enc[:-1])
    continuation_len = len(continuation_enc)
    input_len, = model_input.shape

    # [seq_len] -> [seq_len, vocab]
    logprobs = torch.nn.functional.log_softmax(model(model_input)[0], dim=-1).cpu()
    # [seq_len, vocab] -> [continuation_len, vocab]
    logprobs = logprobs[input_len-continuation_len:]
    # Gather the log probabilities of the continuation tokens -> [continuation_len]
    logprobs = torch.gather(logprobs, 1, torch.tensor(continuation_enc).unsqueeze(-1)).squeeze(-1)
    score = torch.sum(logprobs)
    # The higher (closer to 0), the more similar
    print(f"Document: {doc[:20] + '...'} Score: {score}")

```

Symmetric Semantic Search CE

You can use the same code as in the above CE-Asym section but change the prompt. Feel free to share prompts that work well :)

Use SGPT with Sentence Transformers

Bi-Encoder ST

Symmetric Semantic Search BE ST

Symmetric models are now 100% compatible with the latest sentence-transformers via pip install git+https://github.com/UKPLab/sentence-transformers.git. You should get the same results as in the HuggingFace script above.

```python from scipy.spatial.distance import cosine from sentence_transformers import SentenceTransformer

texts = [ "deep learning", "artificial intelligence", "deep diving", "artificial snow", ]

model = SentenceTransformer("Muennighoff/SGPT-125M-weightedmean-nli-bitfit") embeddings = model.encode(texts)

cosinesim01 = 1 - cosine(embeddings[0], embeddings[1]) cosinesim02 = 1 - cosine(embeddings[0], embeddings[2]) cosinesim0_3 = 1 - cosine(embeddings[0], embeddings[3])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosinesim01)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosinesim02)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[3], cosinesim0_3)) ```

Asymmetric Semantic Search BE ST

SGPT Sentence Transformers

Install: pip install --upgrade git+https://github.com/Muennighoff/sentence-transformers.git@sgpt_poolings_specb Use the below, which produces the exact same scores as the HuggingFace solution above.

```python from scipy.spatial.distance import cosine from sentence_transformers import SentenceTransformer

queries = [ "I'm searching for a planet not too far from Earth.", ]

docs = [ "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.", "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.", "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.", ]

class SentenceTransformerSpecb(SentenceTransformer): # Requires: # pip install git+https://github.com/Muennighoff/sentence-transformers.git@sgptpoolingsspecb def init(self, args, *kwargs): super().init(args, *kwargs) tokens = ["[SOS]", "{SOS}"] self.firstmodule().tokenizer.addtokens(tokens, specialtokens=True) self.firstmodule().automodel.resizetokenembeddings(len(self.firstmodule().tokenizer)) # Will be replaced with the rep tokens in the model ones # The problem is we don't know if a text is query or document when tokenizing in the Transformer.py module, # so we use the SOS tokens as an identifier if we have a query or document at hand & then replace them # If we would directly use the brackets here, they may become part of another token self.firstmodule().bosspectokenq = self.firstmodule().tokenizer.encode("[SOS]", addspecialtokens=False)[0] self.firstmodule().bosspectokend = self.firstmodule().tokenizer.encode("{SOS}", addspecialtokens=False)[0] self.firstmodule().bosspectokenqrep = self.firstmodule().tokenizer.encode("[", addspecialtokens=False)[0] self.firstmodule().eosspectokenq = self.firstmodule().tokenizer.encode("]", addspecialtokens=False)[0] self.firstmodule().bosspectokendrep = self.firstmodule().tokenizer.encode("{", addspecialtokens=False)[0] self.firstmodule().eosspectokend = self.firstmodule().tokenizer.encode("}", addspecialtokens=False)[0] self.firstmodule().replacebos = True

def encode(self, sentences, **kwargs):
    is_query = kwargs.pop("is_query", True)
    if is_query:
        sentences = "[SOS]" + sentences if isinstance(sentences, str) else ["[SOS]" + sent for sent in sentences]
    else:
        sentences = "{SOS}" + sentences if isinstance(sentences, str) else ["{SOS}" + sent for sent in sentences]    
    return super().encode(sentences, **kwargs)

model = SentenceTransformerSpecb("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")

queryembeddings = model.encode(queries, isquery=True) docembeddings = model.encode(docs, isquery=False)

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

cosinesim01 = 1 - cosine(queryembeddings[0], docembeddings[0]) cosinesim02 = 1 - cosine(queryembeddings[0], docembeddings[1]) cosinesim03 = 1 - cosine(queryembeddings[0], doc_embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[0][:20] + "...", cosinesim01)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[1][:20] + "...", cosinesim02)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[2][:20] + "...", cosinesim0_3)) ```

Original Sentence Transformers

If you want to use the Sentence Transformers at https://github.com/UKPLab/sentence-transformers, you can use the below. Make sure to use the latest version (pip install --upgrade git+https://github.com/UKPLab/sentence-transformers.git). Note that this will produce slightly worse scores than SGPT Sentence Transformers, as the special brackets may get intermingled with other tokens upon tokenization. On SciFact (BEIR) NDCG@10 of the below decreases to 0.566 from 0.569 for SGPT-125M-weightedmean-msmarco-specb-bitfit.

```python from scipy.spatial.distance import cosine from sentence_transformers import SentenceTransformer

queries = [ "I'm searching for a planet not too far from Earth.", ]

docs = [ "Neptune is the eighth and farthest-known Solar planet from the Sun. In the Solar System, it is the fourth-largest planet by diameter, the third-most-massive planet, and the densest giant planet. It is 17 times the mass of Earth, slightly more massive than its near-twin Uranus.", "TRAPPIST-1d, also designated as 2MASS J23062928-0502285 d, is a small exoplanet (about 30% the mass of the earth), which orbits on the inner edge of the habitable zone of the ultracool dwarf star TRAPPIST-1 approximately 40 light-years (12.1 parsecs, or nearly 3.7336×1014 km) away from Earth in the constellation of Aquarius.", "A harsh desert world orbiting twin suns in the galaxy’s Outer Rim, Tatooine is a lawless place ruled by Hutt gangsters. Many settlers scratch out a living on moisture farms, while spaceport cities such as Mos Eisley and Mos Espa serve as home base for smugglers, criminals, and other rogues.", ]

class SentenceTransformerSpecb(SentenceTransformer): def encode(self, sentences, *kwargs): isquery = kwargs.pop("isquery", True) if is_query: sentences = "[" + sentences + "]" if isinstance(sentences, str) else ["[" + sent + "]" for sent in sentences] else: sentences = "{" + sentences + "}" if isinstance(sentences, str) else ["{" + sent + "}" for sent in sentences]
return super().encode(sentences, *kwargs)

model = SentenceTransformerSpecb("Muennighoff/SGPT-125M-weightedmean-msmarco-specb-bitfit")

queryembeddings = model.encode(queries, isquery=True) docembeddings = model.encode(docs, isquery=False)

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

cosinesim01 = 1 - cosine(queryembeddings[0], docembeddings[0]) cosinesim02 = 1 - cosine(queryembeddings[0], docembeddings[1]) cosinesim03 = 1 - cosine(queryembeddings[0], doc_embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[0][:20] + "...", cosinesim01)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[1][:20] + "...", cosinesim02)) print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (queries[0], docs[2][:20] + "...", cosinesim0_3)) ```

Acknowledgements

We thank Constantin Eichenberg and Samuel Weinbach for insightful discussions and valuable feedback throughout the project. We thank Robert Baldock, Marco Bellagente and Koen Oostermeijer for reading drafts of the paper. This work has been supported by OpenAI under the academic access program. This work would not have been possible without: - UKPLab: SBERT, BEIR, USEB - Eleuther AI Models - Huggingface Transformers

Citation

Feel free to cite our paper if SGPT is helpful to you :)

bibtex @article{muennighoff2022sgpt, title={SGPT: GPT Sentence Embeddings for Semantic Search}, author={Muennighoff, Niklas}, journal={arXiv preprint arXiv:2202.08904}, year={2022} }

Owner

Name: Niklas Muennighoff
Login: Muennighoff
Kind: user
Location: Beijing
Company: PKU

Website: muennighoff.github.io
Twitter: Muennighoff
Repositories: 17
Profile: https://github.com/Muennighoff

Citation (CITATION.bib)

@article{muennighoff2022sgpt,
  title={SGPT: GPT Sentence Embeddings for Semantic Search},
  author={Muennighoff, Niklas},
  journal={arXiv preprint arXiv:2202.08904},
  year={2022}
}

GitHub Events

Total

Watch event: 27
Fork event: 3

Last Year

Watch event: 27
Fork event: 3

Committers

Last synced: 9 months ago

All Time

Total Commits: 58
Total Committers: 3
Avg Commits per committer: 19.333
Development Distribution Score (DDS): 0.069

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Muennighoff	n**f@g**m	54
Akshaj Jain	a**n@g**m	2
Oracle Public Cloud User	o**c@r**m	2

Committer Domains (Top 20 + Academic)

rl-node.subnet08051147.vcn08051147.oraclevcn.com: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 83
Total pull requests: 10
Average time to close issues: 25 days
Average time to close pull requests: 2 days
Total issue authors: 36
Total pull request authors: 3
Average comments per issue: 3.28
Average comments per pull request: 0.8
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

rajarajanvakil (2)
Kartali-Mohamed (2)
guotong1988 (2)
aksj98 (2)
regstuff (2)
asenasen123 (2)
ashokrajab (1)
ttjjlw (1)
shafkat-07 (1)
wing7171 (1)
ennioferreirab (1)
hongshanli23 (1)
cm2435 (1)
rut00 (1)
KnutJaegersberg (1)

sgpt

Science Score: 41.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SGPT: GPT Sentence Embeddings for Semantic Search

Quick Links

Overview

Structure

Use SGPT with Huggingface

Bi-Encoder

Symmetric Semantic Search BE

Get our models - The package will take care of downloading the models automatically

For best performance: Muennighoff/SGPT-5.8B-weightedmean-nli-bitfit

Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)

Tokenize input texts

Get the embeddings

Get weights of shape [bs, seqlen, hiddim]

Get attn mask of shape [bs, seqlen, hiddim]

Perform weighted mean pooling across seqlen: bs, seqlen, hiddendim -> bs, hiddendim

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

Asymmetric Semantic Search BE

Get our models - The package will take care of downloading the models automatically

For best performance: Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit

Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

Cross-Encoder

Asymmetric Semantic Search CE

Get models - The package will take care of downloading the models automatically

For best performance: EleutherAI/gpt-j-6B

Deactivate Dropout (There is no dropout in the above models so it makes no difference here but other SGPT models may have dropout)

Symmetric Semantic Search CE

Use SGPT with Sentence Transformers

Bi-Encoder ST

Symmetric Semantic Search BE ST

Asymmetric Semantic Search BE ST

SGPT Sentence Transformers

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

Original Sentence Transformers

Calculate cosine similarities

Cosine similarities are in [-1, 1]. Higher means more similar

Acknowledgements

Citation

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies