retriv

A Python Search Engine for Humans 🥸

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Keywords

bm25 dense-retrieval hybrid-retrieval information-retrieval numba search search-engine search-engine-optimization semantic-search sparse-retrieval tf-idf

Last synced: 11 months ago · JSON representation ·

Repository

A Python Search Engine for Humans 🥸

Basic Info

Host: GitHub
Owner: AmenRa
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 372 KB

Statistics

Stars: 227
Watchers: 10
Forks: 28
Open Issues: 22
Releases: 1

Topics

bm25 dense-retrieval hybrid-retrieval information-retrieval numba search search-engine search-engine-optimization semantic-search sparse-retrieval tf-idf

Created over 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme Changelog License Citation

🔥 News

[August 23, 2023] retriv 0.2.2 is out!
This release adds experimental support for multi-field documents and filters. Please, refer to Advanced Retriever documentation.
[February 18, 2023] retriv 0.2.0 is out!
This release adds support for Dense and Hybrid Retrieval. Dense Retrieval leverages the semantic similarity of the queries' and documents' vector representations, which can be computed directly by retriv or imported from other sources. Hybrid Retrieval mix traditional retrieval, informally called Sparse Retrieval, and Dense Retrieval results to further improve retrieval effectiveness. As the library was almost completely redone, indices built with previous versions are no longer supported.

⚡️ Introduction

retriv is a user-friendly and efficient search engine implemented in Python supporting Sparse (traditional search with BM25, TF-IDF), Dense (semantic search) and Hybrid retrieval (a mix of Sparse and Dense Retrieval). It allows you to build a search engine in a single line of code.

retriv is built upon Numba for high-speed vector operations and automatic parallelization, PyTorch and Transformers for easy access and usage of Transformer-based Language Models, and Faiss for approximate nearest neighbor search. In addition, it provides automatic tuning functionalities to allow you to tune its internal components with minimal intervention.

✨ Main Features

Retrievers

Sparse Retriever: standard searcher based on lexical matching. retriv implements BM25 as its main retrieval model. TF-IDF is also supported for educational purposes. The sparse retriever comes armed with multiple stemmers, tokenizers, and stop-word lists, for multiple languages. Click here to learn more.
Dense Retriever: a dense retriever is a retrieval model that performs semantic search. Click here to learn more.
Hybrid Retriever: an hybrid retriever is a retrieval model built on top of a sparse and a dense retriever. Click here to learn more.
Advanced Retriever: an advanced sparse retriever supporting filters. This is and experimental feature. Click here to learn more.

Unified Search Interface

All the supported retrievers share the same search interface: - search: standard search functionality, what you expect by a search engine. - msearch: computes the results for multiple queries at once. It leverages automatic parallelization whenever possible. - bsearch: similar to msearch but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is handy for computing results for hundreds of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of Neural Models for Information Retrieval.

AutoTune

retriv automatically tunes Faiss configuration for approximate nearest neighbors search by leveraging AutoFaiss to guarantee 10ms response time based on your available hardware. Moreover, it offers an automatic tuning functionality for BM25's parameters, which require minimal user intervention. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one. Finally, it can automatically balance the importance of lexical and semantic relevance scores computed by the Hybrid Retriever to maximize retrieval effectiveness.

📚 Documentation

🔌 Requirements

python>=3.8

💾 Installation

bash pip install retriv

💡 Minimal Working Example

```python

Note: SearchEngine is an alias for the SparseRetriever

from retriv import SearchEngine

collection = [ {"id": "doc1", "text": "Generals gathered in their masses"}, {"id": "doc2", "text": "Just like witches at black masses"}, {"id": "doc3", "text": "Evil minds that plot destruction"}, {"id": "doc4", "text": "Sorcerer of death's construction"}, ]

se = SearchEngine("new-index").index(collection)

se.search("witches masses") Output:json [ { "id": "doc2", "text": "Just like witches at black masses", "score": 1.7536403 }, { "id": "doc1", "text": "Generals gathered in their masses", "score": 0.6931472 } ] ```

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

retriv is an open-sourced software licensed under the MIT license.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Bassani"
  given-names: "Elias"
  orcid: "https://orcid.org/0000-0001-7922-2578"
title: "retriv: A Python Search Engine for the Common Man"
version: 0.2.1
doi: 10.5281/zenodo.7978820
date-released: 2023-05-28
url: "https://github.com/AmenRa/retriv"

GitHub Events

Total

Issues event: 3
Watch event: 42
Issue comment event: 5
Pull request event: 4
Fork event: 5

Last Year

Issues event: 3
Watch event: 42
Issue comment event: 5
Pull request event: 4
Fork event: 5

Committers

Last synced: over 3 years ago

All Time

Total Commits: 19
Total Committers: 1
Avg Commits per committer: 19.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Elias Bassani	e**n@g**m	19

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 40
Total pull requests: 7
Average time to close issues: about 1 month
Average time to close pull requests: about 7 hours
Total issue authors: 26
Total pull request authors: 6
Average comments per issue: 2.25
Average comments per pull request: 0.29
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: about 2 hours
Issue authors: 5
Pull request authors: 2
Average comments per issue: 0.33
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

celsofranssa (10)
hockyy (2)
Ch41r05 (2)
sXperfect (1)
skykiseki (1)
amallia (1)
cnndabbler (1)
tingliu2018 (1)
AdamJSoftware (1)
regstuff (1)
alex2awesome (1)
martiansideofthemoon (1)
Alex-S-H-P (1)
jacobvsdanniel (1)
msharara1998 (1)

Pull Request Authors

mabounassif (4)
WojciechKusa (2)
juliuslipp (2)
luoyangen (2)
martiansideofthemoon (1)
alex2awesome (1)

Top Labels

Issue Labels

enhancement (10) question (1) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 438 last-month

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 10
Total maintainers: 1

pypi.org: retriv

retriv: A Python Search Engine for Humans.

Homepage: https://github.com/AmenRa/retriv
Documentation: https://retriv.readthedocs.io/
License: MIT License
Latest release: 0.2.3
published almost 3 years ago

Versions: 10
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 438 Last month

Rankings

Dependent packages count: 4.7%

Stargazers count: 6.3%

Downloads: 8.2%

Forks count: 9.8%

Average: 10.1%

Dependent repos count: 21.7%

Maintainers (1)

AmenRa

Last synced: 12 months ago

retriv

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

🔥 News

⚡️ Introduction

✨ Main Features

Retrievers

Unified Search Interface

AutoTune

📚 Documentation

🔌 Requirements

💾 Installation

💡 Minimal Working Example

Note: SearchEngine is an alias for the SparseRetriever

🎁 Feature Requests

🤘 Want to contribute?

📄 License

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: retriv

Rankings

Maintainers (1)