flashrank

Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.

https://github.com/prithivirajdamodaran/flashrank

Keywords

cross-encoder full-text-search hybrid-search lexical-search rag ranking reranking retrieval-augmented-generation semantic-search vector-database vector-search

Last synced: 6 months ago · JSON representation ·

Repository

Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.

Basic Info

Host: GitHub
Owner: PrithivirajDamodaran
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 2.47 MB

Statistics

Stars: 851
Watchers: 6
Forks: 62
Open Issues: 7
Releases: 3

Topics

cross-encoder full-text-search hybrid-search lexical-search rag ranking reranking retrieval-augmented-generation semantic-search vector-database vector-search

Created about 2 years ago · Last pushed 8 months ago

Metadata Files

Readme Funding License Citation

README.md

[![Downloads](https://static.pepy.tech/badge/flashrank)](https://pepy.tech/project/flashrank) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)]() [![license]( https://img.shields.io/badge/License-Apache-blue.svg)](https://opensource.org/licenses/Apache2.0) [![package]( https://img.shields.io/badge/Package-PYPI-blue.svg)](https://pypi.org/project/FlashRank/)[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11093524.svg)](https://doi.org/10.5281/zenodo.11093524)

Re-rank your search results with SoTA Pairwise or Listwise rerankers before feeding into your LLMs

Ultra-lite & Super-fast Python library to add re-ranking to your existing search & retrieval pipelines. It is based on SoTA LLMs and cross-encoders, with gratitude to all the model owners.

Supports:

Pairwise / Pointwise rerankers. (Cross encoder based, i.e. Max tokens = 512)
Listwise LLM based rerankers. (LLM based, i.e. Max tokens = 8192)
See below for full list of supported models.

⚡ Ultra-lite:
- No Torch or Transformers needed. Runs on CPU.
- Boasts the tiniest reranking model in the world, ~4MB.
⏱️ Super-fast:
- Rerank speed is a function of # of tokens in passages, query + model depth (layers)
- To give an idea, Time taken by the example (in code) using the default model is below.
- Detailed benchmarking, TBD
💸 $ concious:
- Lowest $ per invocation: Serverless deployments like Lambda are charged by memory & time per invocation*
- Smaller package size = shorter cold start times, quicker re-deployments for Serverless.
🎯 Based on SoTA Cross-encoders and other models:
- "How good are Zero-shot rerankers?" - look at the reference section.

| Model Name | Description | Size | Notes | |------------|-------------|------|-------| | ms-marco-TinyBERT-L-2-v2 | Default model | ~4MB | Model card | | ms-marco-MiniLM-L-12-v2 | Best Cross-encoder reranker | ~34MB | Model card | | rank-T5-flan | Best non cross-encoder reranker | ~110MB | Model card | | ms-marco-MultiBERT-L-12 | Multi-lingual, supports 100+ languages | ~150MB | Supported languages | | ce-esci-MiniLM-L12-v2 | Fine-tuned on Amazon ESCI dataset | - | Model card | | rank_zephyr_7b_v1_full | 4-bit-quantised GGUF | ~4GB | Model card | | miniReranker_arabic_v1 | Only dedicated Arabic Reranker | - | Model card |

Models in roadmap:
- InRanker
Why sleeker models are preferred ? Reranking is the final leg of larger retrieval pipelines, idea is to avoid any extra overhead especially for user-facing scenarios. To that end models with really small footprint that doesn't need any specialised hardware and yet offer competitive performance are chosen. Feel free to raise issues to add support for a new models as you see fit.

Installation:

If you need lightweight pairwise rerankers [default]

python pip install flashrank

If you need LLM based listwise rerankers

python pip install flashrank[listwise]

Making ranking faster:

max_length value should be large able to accomodate your longest passage. In other words if your longest passage (100 tokens) + query (16 tokens) pair by token estimate is 116 then say setting max_length = 128 is good enough inclhuding room for reserved tokens like [CLS] and [SEP]. Use Openai tiktoken like libraries to estimate token density, if performance per token is critical for you. Non-chalantly giving a longer max_length like 512 for smaller passage sizes will negatively affect response time.

Getting started:

```python from flashrank import Ranker, RerankRequest

Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker(max_length=128)

or

Small (~34MB), slightly slower & best performance (ranking precision).

ranker = Ranker(modelname="ms-marco-MiniLM-L-12-v2", cachedir="/opt")

or

Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.

ranker = Ranker(modelname="rank-T5-flan", cachedir="/opt")

or

Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages (don't use for english)

ranker = Ranker(modelname="ms-marco-MultiBERT-L-12", cachedir="/opt")

or

ranker = Ranker(modelname="rankzephyr7bv1full", maxlength=1024) # adjust max_length based on your passage length ```

```python

Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.

query = "How to speedup LLMs?" passages = [ { "id":1, "text":"Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.", "meta": {"additional": "info1"} }, { "id":2, "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper", "meta": {"additional": "info2"} }, { "id":3, "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.", "meta": {"additional": "info3"}

}, { "id":4, "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.", "meta": {"additional": "info4"} }, { "id":5, "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels", "meta": {"additional": "info5"} } ]

rerankrequest = RerankRequest(query=query, passages=passages) results = ranker.rerank(rerankrequest) print(results) ```

```python

Reranked output from default reranker

[ { "id":4, "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.", "meta":{ "additional":"info4" }, "score":0.016847236 }, { "id":5, "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels", "meta":{ "additional":"info5" }, "score":0.011563735 }, { "id":3, "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.", "meta":{ "additional":"info3" }, "score":0.00081340264 }, { "id":1, "text":"Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.", "meta":{ "additional":"info1" }, "score":0.00063596206 }, { "id":2, "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper", "meta":{ "additional":"info2" }, "score":0.00024851 } ] ```

You can use it with any search & retrieval pipeline:

Lexical Search (RegularDBs that supports full-text search or Inverted Index)

Semantic Search / RAG usecases (VectorDBs)
Hybrid Search

Deployment patterns

How to use it in a AWS Lambda function ?

In AWS or other serverless environments the entire VM is read-only you might have to create your own custom dir. You can do so in your Dockerfile and use it for loading the models (and eventually as a cache between warm calls). You can do it during init with cache_dir parameter.

python ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

References:

In-domain and Zeroshot performance of Cross Encoders fine-tuned on MS-MARCO

In-domain and Zeroshot performance of RankT5 fine-tuned on MS-MARCO

How to Cite?

To cite this repository in your work please click the "cite this repository" link on the right side (bewlow repo descriptions and tags)

Papers citing flashrank

[IMPORTANT UPDATE]

~~A clone library called *SwiftRank is pointing to our model buckets, we are working on a interim solution to avoid this stealing*. Thank you for patience and understanding.~~

This issue is resolved, the models are in HF now. please upgrade to continue pip install -U flashrank. Thank you for patience and understanding

Owner

Name: Prithivida
Login: PrithivirajDamodaran
Kind: user
Location: Bangkok

Website: https://www.youtube.com/@prithivida
Repositories: 16
Profile: https://github.com/PrithivirajDamodaran

Dense, Sparse and Hybrid Embeddings for LLMs, Multimodal Modelling & Data Engineering. Checkout my (YouTube series on V+L, linked below)

Citation (CITATION.cff)

cff-version: 1.2.0
message: Please cite it as below.
title: FlashRank, Lightest and Fastest 2nd Stage Reranker for search pipelines.
doi: 10.5281/zenodo.10426927
date-released: 23-Dec-2023

authors:
  - family-names: Damodaran
    given-names: Prithiviraj
    affiliation: Independent Researcher

version: 1.0.0
url: https://github.com/PrithivirajDamodaran/FlashRank

GitHub Events

Total

Create event: 1
Issues event: 4
Release event: 1
Watch event: 200
Issue comment event: 12
Push event: 1
Pull request review event: 3
Pull request event: 3
Fork event: 13

Last Year

Create event: 1
Issues event: 4
Release event: 1
Watch event: 200
Issue comment event: 12
Push event: 1
Pull request review event: 3
Pull request event: 3
Fork event: 13

Committers

Last synced: 9 months ago

All Time

Total Commits: 130
Total Committers: 7
Avg Commits per committer: 18.571
Development Distribution Score (DDS): 0.346

Past Year

Commits: 31
Committers: 5
Avg Commits per committer: 6.2
Development Distribution Score (DDS): 0.226

Top Committers

Name	Email	Commits
Prithivi Da	p**a@P**l	85
Prithivi Da	P****a	24
Prithivida	d**j@g**m	17
Prabhkaran Singh	p**u@g**m	1
Gustavo Pinto	g**o@z**r	1
Agamdeep Singh	6****0	1
Ivan Vlasov	i**v@r**m	1

Committer Domains (Top 20 + Academic)

raftds.com: 1 zup.com.br: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 28
Total pull requests: 13
Average time to close issues: 17 days
Average time to close pull requests: 27 days
Total issue authors: 26
Total pull request authors: 10
Average comments per issue: 1.82
Average comments per pull request: 1.85
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 4
Average time to close issues: 2 days
Average time to close pull requests: 12 days
Issue authors: 4
Pull request authors: 3
Average comments per issue: 0.75
Average comments per pull request: 1.25
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

andy1xx8 (2)
M1ha-Shvn (2)
chrispy-snps (1)
anudit (1)
albertopasqualetto (1)
vipinap98 (1)
JarbasAl (1)
lsorber (1)
sudhanshu746 (1)
alexandruvesa (1)
kingsene19 (1)
maruthiprithivi (1)
congdaoduy298 (1)
khanzzirfan (1)
pmontu (1)

Pull Request Authors

nmohr192 (6)
IvVlasov (2)
kaizer-rb (2)
prabhkaran (2)
sudhanshu746 (2)
gustavopintozup (2)
gnought (2)
srimouli04 (2)
jnash10 (2)
synacktraa (1)

Top Labels

Issue Labels

out-of-scope (1) invalid (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 86,518 last-month

Total dependent packages: 1
Total dependent repositories: 0
Total versions: 27
Total maintainers: 1

pypi.org: flashrank

Ultra lite & Super fast SoTA cross-encoder based re-ranking for your search & retrieval pipelines.

Homepage: https://github.com/PrithivirajDamodaran/FlashRank
Documentation: https://flashrank.readthedocs.io/
License: Apache 2.0
Latest release: 0.2.10
published about 1 year ago

Versions: 27
Dependent Packages: 1
Dependent Repositories: 0
Downloads: 86,518 Last month

Rankings

Dependent packages count: 10.1%

Average: 38.7%

Dependent repos count: 67.2%

Maintainers (1)

prithivida

Last synced: 6 months ago

flashrank

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Table of Contents

Features

Installation:

If you need lightweight pairwise rerankers [default]

If you need LLM based listwise rerankers

Making ranking faster:

Getting started:

Nano (~4MB), blazing fast model & competitive performance (ranking precision).

Small (~34MB), slightly slower & best performance (ranking precision).

Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.

Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages (don't use for english)

Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.

Reranked output from default reranker

You can use it with any search & retrieval pipeline:

Deployment patterns

How to use it in a AWS Lambda function ?

References:

How to Cite?

Papers citing flashrank

[IMPORTANT UPDATE]

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: flashrank

Rankings

Maintainers (1)

Dependencies