flashrank

Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.

https://github.com/prithivirajdamodaran/flashrank

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

cross-encoder full-text-search hybrid-search lexical-search rag ranking reranking retrieval-augmented-generation semantic-search vector-database vector-search
Last synced: 6 months ago · JSON representation ·

Repository

Lite & Super-fast re-ranking for your search & retrieval pipelines. Supports SoTA Listwise and Pairwise reranking based on LLMs and cross-encoders and more. Created by Prithivi Da, open for PRs & Collaborations.

Basic Info
  • Host: GitHub
  • Owner: PrithivirajDamodaran
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.47 MB
Statistics
  • Stars: 851
  • Watchers: 6
  • Forks: 62
  • Open Issues: 7
  • Releases: 3
Topics
cross-encoder full-text-search hybrid-search lexical-search rag ranking reranking retrieval-augmented-generation semantic-search vector-database vector-search
Created about 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Funding License Citation

README.md

[![Downloads](https://static.pepy.tech/badge/flashrank)](https://pepy.tech/project/flashrank) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)]() [![license]( https://img.shields.io/badge/License-Apache-blue.svg)](https://opensource.org/licenses/Apache2.0) [![package]( https://img.shields.io/badge/Package-PYPI-blue.svg)](https://pypi.org/project/FlashRank/)[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11093524.svg)](https://doi.org/10.5281/zenodo.11093524)

Re-rank your search results with SoTA Pairwise or Listwise rerankers before feeding into your LLMs

Ultra-lite & Super-fast Python library to add re-ranking to your existing search & retrieval pipelines. It is based on SoTA LLMs and cross-encoders, with gratitude to all the model owners.

Supports:

  • Pairwise / Pointwise rerankers. (Cross encoder based, i.e. Max tokens = 512)
  • Listwise LLM based rerankers. (LLM based, i.e. Max tokens = 8192)
  • See below for full list of supported models.

Table of Contents

  1. Features
  2. Installation
  3. Making ranking faster
  4. Getting started
  5. Deployment patterns
  6. How to Cite?
  7. Papers citing flashrank

Features

  1. Ultra-lite:

    • No Torch or Transformers needed. Runs on CPU.
    • Boasts the tiniest reranking model in the world, ~4MB.
  2. ⏱️ Super-fast:

    • Rerank speed is a function of # of tokens in passages, query + model depth (layers)
    • To give an idea, Time taken by the example (in code) using the default model is below.
    • Detailed benchmarking, TBD
  3. 💸 $ concious:

    • Lowest $ per invocation: Serverless deployments like Lambda are charged by memory & time per invocation*
    • Smaller package size = shorter cold start times, quicker re-deployments for Serverless.
  4. 🎯 Based on SoTA Cross-encoders and other models:

    • "How good are Zero-shot rerankers?" - look at the reference section. <!-- - Below are the list of models supported as of now.
      • ms-marco-TinyBERT-L-2-v2 (default) Model card
      • ms-marco-MiniLM-L-12-v2 Model card
      • rank-T5-flan (Best non cross-encoder reranker) Model card
      • ms-marco-MultiBERT-L-12 (Multi-lingual, supports 100+ languages)
      • ce-esci-MiniLM-L12-v2 FT on Amazon ESCI dataset (This is interesting because most models are FT on MSFT MARCO Bing queries) Model card
      • rank_zephyr_7b_v1_full (4-bit-quantised GGUF) Model card (Offers very competitive performance, with large context window and relatively faster for a 4GB model).
        • Important note: Our current integration of rank_zephyr supports a max of 20 passages in one pass. The sliding window logic support is yet to be added.
      • miniReranker_arabic_v1 Model card -->

| Model Name | Description | Size | Notes | |------------|-------------|------|-------| | ms-marco-TinyBERT-L-2-v2 | Default model | ~4MB | Model card | | ms-marco-MiniLM-L-12-v2 | Best Cross-encoder reranker | ~34MB | Model card | | rank-T5-flan | Best non cross-encoder reranker | ~110MB | Model card | | ms-marco-MultiBERT-L-12 | Multi-lingual, supports 100+ languages | ~150MB | Supported languages | | ce-esci-MiniLM-L12-v2 | Fine-tuned on Amazon ESCI dataset | - | Model card | | rank_zephyr_7b_v1_full | 4-bit-quantised GGUF | ~4GB | Model card | | miniReranker_arabic_v1 | Only dedicated Arabic Reranker | - | Model card |

  • Models in roadmap:
    • InRanker
  • Why sleeker models are preferred ? Reranking is the final leg of larger retrieval pipelines, idea is to avoid any extra overhead especially for user-facing scenarios. To that end models with really small footprint that doesn't need any specialised hardware and yet offer competitive performance are chosen. Feel free to raise issues to add support for a new models as you see fit.

Installation:

If you need lightweight pairwise rerankers [default]

python pip install flashrank

If you need LLM based listwise rerankers

python pip install flashrank[listwise]

Making ranking faster:

max_length value should be large able to accomodate your longest passage. In other words if your longest passage (100 tokens) + query (16 tokens) pair by token estimate is 116 then say setting max_length = 128 is good enough inclhuding room for reserved tokens like [CLS] and [SEP]. Use Openai tiktoken like libraries to estimate token density, if performance per token is critical for you. Non-chalantly giving a longer max_length like 512 for smaller passage sizes will negatively affect response time.

Getting started:

```python from flashrank import Ranker, RerankRequest

Nano (~4MB), blazing fast model & competitive performance (ranking precision).

ranker = Ranker(max_length=128)

or

Small (~34MB), slightly slower & best performance (ranking precision).

ranker = Ranker(modelname="ms-marco-MiniLM-L-12-v2", cachedir="/opt")

or

Medium (~110MB), slower model with best zeroshot performance (ranking precision) on out of domain data.

ranker = Ranker(modelname="rank-T5-flan", cachedir="/opt")

or

Medium (~150MB), slower model with competitive performance (ranking precision) for 100+ languages (don't use for english)

ranker = Ranker(modelname="ms-marco-MultiBERT-L-12", cachedir="/opt")

or

ranker = Ranker(modelname="rankzephyr7bv1full", maxlength=1024) # adjust max_length based on your passage length ```

```python

Metadata is optional, Id can be your DB ids from your retrieval stage or simple numeric indices.

query = "How to speedup LLMs?" passages = [ { "id":1, "text":"Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.", "meta": {"additional": "info1"} }, { "id":2, "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper", "meta": {"additional": "info2"} }, { "id":3, "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.", "meta": {"additional": "info3"}

}, { "id":4, "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.", "meta": {"additional": "info4"} }, { "id":5, "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels", "meta": {"additional": "info5"} } ]

rerankrequest = RerankRequest(query=query, passages=passages) results = ranker.rerank(rerankrequest) print(results) ```

```python

Reranked output from default reranker

[ { "id":4, "text":"Ever want to make your LLM inference go brrrrr but got stuck at implementing speculative decoding and finding the suitable draft model? No more pain! Thrilled to unveil Medusa, a simple framework that removes the annoying draft model while getting 2x speedup.", "meta":{ "additional":"info4" }, "score":0.016847236 }, { "id":5, "text":"vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: State-of-the-art serving throughput Efficient management of attention key and value memory with PagedAttention Continuous batching of incoming requests Optimized CUDA kernels", "meta":{ "additional":"info5" }, "score":0.011563735 }, { "id":3, "text":"There are many ways to increase LLM inference throughput (tokens/second) and decrease memory footprint, sometimes at the same time. Here are a few methods I’ve found effective when working with Llama 2. These methods are all well-integrated with Hugging Face. This list is far from exhaustive; some of these techniques can be used in combination with each other and there are plenty of others to try. - Bettertransformer (Optimum Library): Simply call model.to_bettertransformer() on your Hugging Face model for a modest improvement in tokens per second. - Fp4 Mixed-Precision (Bitsandbytes): Requires minimal configuration and dramatically reduces the model's memory footprint. - AutoGPTQ: Time-consuming but leads to a much smaller model and faster inference. The quantization is a one-time cost that pays off in the long run.", "meta":{ "additional":"info3" }, "score":0.00081340264 }, { "id":1, "text":"Introduce lookahead decoding: - a parallel decoding algo to accelerate LLM inference - w/o the need for a draft model or a data store - linearly decreases # decoding steps relative to log(FLOPs) used per decoding step.", "meta":{ "additional":"info1" }, "score":0.00063596206 }, { "id":2, "text":"LLM inference efficiency will be one of the most crucial topics for both industry and academia, simply because the more efficient you are, the more $$$ you will save. vllm project is a must-read for this direction, and now they have just released the paper", "meta":{ "additional":"info2" }, "score":0.00024851 } ] ```

You can use it with any search & retrieval pipeline:

  1. Lexical Search (RegularDBs that supports full-text search or Inverted Index)


  1. Semantic Search / RAG usecases (VectorDBs)


  2. Hybrid Search


Deployment patterns

How to use it in a AWS Lambda function ?

In AWS or other serverless environments the entire VM is read-only you might have to create your own custom dir. You can do so in your Dockerfile and use it for loading the models (and eventually as a cache between warm calls). You can do it during init with cache_dir parameter.

python ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")

References:

  1. In-domain and Zeroshot performance of Cross Encoders fine-tuned on MS-MARCO


  1. In-domain and Zeroshot performance of RankT5 fine-tuned on MS-MARCO

How to Cite?

To cite this repository in your work please click the "cite this repository" link on the right side (bewlow repo descriptions and tags)

Papers citing flashrank

[IMPORTANT UPDATE]

~~A clone library called *SwiftRank is pointing to our model buckets, we are working on a interim solution to avoid this stealing*. Thank you for patience and understanding.~~

This issue is resolved, the models are in HF now. please upgrade to continue pip install -U flashrank. Thank you for patience and understanding

Owner

  • Name: Prithivida
  • Login: PrithivirajDamodaran
  • Kind: user
  • Location: Bangkok

Dense, Sparse and Hybrid Embeddings for LLMs, Multimodal Modelling & Data Engineering. Checkout my (YouTube series on V+L, linked below)

Citation (CITATION.cff)

cff-version: 1.2.0
message: Please cite it as below.
title: FlashRank, Lightest and Fastest 2nd Stage Reranker for search pipelines.
doi: 10.5281/zenodo.10426927
date-released: 23-Dec-2023

authors:
  - family-names: Damodaran
    given-names: Prithiviraj
    affiliation: Independent Researcher

version: 1.0.0
url: https://github.com/PrithivirajDamodaran/FlashRank

GitHub Events

Total
  • Create event: 1
  • Issues event: 4
  • Release event: 1
  • Watch event: 200
  • Issue comment event: 12
  • Push event: 1
  • Pull request review event: 3
  • Pull request event: 3
  • Fork event: 13
Last Year
  • Create event: 1
  • Issues event: 4
  • Release event: 1
  • Watch event: 200
  • Issue comment event: 12
  • Push event: 1
  • Pull request review event: 3
  • Pull request event: 3
  • Fork event: 13

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 130
  • Total Committers: 7
  • Avg Commits per committer: 18.571
  • Development Distribution Score (DDS): 0.346
Past Year
  • Commits: 31
  • Committers: 5
  • Avg Commits per committer: 6.2
  • Development Distribution Score (DDS): 0.226
Top Committers
Name Email Commits
Prithivi Da p****a@P****l 85
Prithivi Da P****a 24
Prithivida d****j@g****m 17
Prabhkaran Singh p****u@g****m 1
Gustavo Pinto g****o@z****r 1
Agamdeep Singh 6****0 1
Ivan Vlasov i****v@r****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 28
  • Total pull requests: 13
  • Average time to close issues: 17 days
  • Average time to close pull requests: 27 days
  • Total issue authors: 26
  • Total pull request authors: 10
  • Average comments per issue: 1.82
  • Average comments per pull request: 1.85
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 4
  • Average time to close issues: 2 days
  • Average time to close pull requests: 12 days
  • Issue authors: 4
  • Pull request authors: 3
  • Average comments per issue: 0.75
  • Average comments per pull request: 1.25
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • andy1xx8 (2)
  • M1ha-Shvn (2)
  • chrispy-snps (1)
  • anudit (1)
  • albertopasqualetto (1)
  • vipinap98 (1)
  • JarbasAl (1)
  • lsorber (1)
  • sudhanshu746 (1)
  • alexandruvesa (1)
  • kingsene19 (1)
  • maruthiprithivi (1)
  • congdaoduy298 (1)
  • khanzzirfan (1)
  • pmontu (1)
Pull Request Authors
  • nmohr192 (6)
  • IvVlasov (2)
  • kaizer-rb (2)
  • prabhkaran (2)
  • sudhanshu746 (2)
  • gustavopintozup (2)
  • gnought (2)
  • srimouli04 (2)
  • jnash10 (2)
  • synacktraa (1)
Top Labels
Issue Labels
out-of-scope (1) invalid (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 86,518 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 0
  • Total versions: 27
  • Total maintainers: 1
pypi.org: flashrank

Ultra lite & Super fast SoTA cross-encoder based re-ranking for your search & retrieval pipelines.

  • Versions: 27
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 86,518 Last month
Rankings
Dependent packages count: 10.1%
Average: 38.7%
Dependent repos count: 67.2%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • numpy *
  • onnxruntime *
  • requests *
  • tokenizers *
  • tqdm *