Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Keywords
Repository
Flexible classic and NeurAl Retrieval Toolkit
Basic Info
Statistics
- Stars: 220
- Watchers: 11
- Forks: 35
- Open Issues: 7
- Releases: 5
Topics
Metadata Files
README.md
FlexNeuART (flex-noo-art)
Flexible classic and NeurAl Retrieval Toolkit, or shortly FlexNeuART (intended pronunciation flex-noo-art)
is a substantially reworked knn4qa package. The overview can be found in our EMNLP OSS workshop paper:
Flexible retrieval with NMSLIB and FlexNeuART, 2020. Leonid Boytsov, Eric Nyberg.
In Aug-Dec 2020, we used this framework to generate best traditional and/or neural runs in the MSMARCO Document ranking task. In fact, our best traditional (non-neural) run slightly outperformed a couple of neural submissions. Please, see our write-up for details: Boytsov, Leonid. "Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard." arXiv preprint arXiv:2012.08020 (2020).
In 2021, after being outsripped by a number of participants, we again advanced to a good position with a help of newly implemented models for ranking long documents. Please, see our write-up for details: Boytsov, L., Lin, T., Gao, F., Zhao, Y., Huang, J., & Nyberg, E. (2022). Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding. At the moment of writing (October 2022), we have competitive submissions on both MS MARCO leaderboards.
Code corresponding to Neural Model 1 is not included as this may be subject to a third party patent. This model (together with its non-contextualized variant) is described and evaluated in our ECIR 2021 paper: Boytsov, Leonid, and Zico Kolter. "Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits." ECIR 2021.
In terms of pure effectiveness on long documents, other models (CEDR & PARADE) seem to be perform equally well (or somewhat better). They are available in our codebase. We are not aware of the patents inhibiting the use of the traditional (non-neural) Model 1.
Objectives
Develop & maintain a (relatively) light-weight modular middleware useful primarily for: * Research * Education * Evaluation & leaderboarding
Main features
- Dense, sparse, or dense-sparse retrieval using Lucene and NMSLIB (dense embeddings can be created using any Sentence BERT model).
- Multi-field multi-level forward indices (+parent-child field relations) that can store parsed and "raw" text input as well as sparse and dense vectors.
- Forward indices can be created in append-only mode, which requires much less RAM.
- Pluggable generic rankers (via a server)
- SOTA neural (PARADE, BERT FirstP/MaxP/Sum, Longformer, COLBERT (re-ranking), dot-product Senence BERT models) and non-neural models (multi-field BM25, IBM Model 1).
- Multi-GPU training and inference with out-of-the box support for ensembling
- Basic experimentation framework (+LETOR)
- Python API to use retrievers and rankers as well as to access indexed data.
Documentation
- Usage notebooks covering installation & most functionality (including experimentation and Python API demo)
- Legacy notebooks for MS MARCO and Yahoo Answers
- Former life (as a knn4qa package), including acknowledgements and publications
We support a number of neural BERT-based ranking models as well as strong traditional ranking models including IBM Model 1 (description of non-neural rankers to follow).
The framework supports data in generic JSONL format. We provide conversion (and in some cases download) scripts for the following collections: * Configurable dataset processing of standard datasets provided by ir-datasets. * MS MARCO v1 and v2 (documents and passages) * Wikipedia DPR (Natural Questions, Trivia QA, SQuAD) * Yahoo Answers * Cranfield (a small toy collection)
Acknowledgements
For neural network training FlexNeuART incorporates a substantially re-worked variant of CEDR (MacAvaney et al' 2019).
Owner
- Name: Open Advancement of Question Answering Systems
- Login: oaqa
- Kind: organization
- Repositories: 45
- Profile: https://github.com/oaqa
Citation (CITATION.cff)
@inproceedings{boytsov2020flexible,
title={Flexible retrieval with NMSLIB and FlexNeuART},
author={Boytsov, Leonid and Nyberg, Eric},
booktitle={Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)},
pages={32--43},
year={2020}
}
GitHub Events
Total
- Release event: 1
- Watch event: 4
- Issue comment event: 1
- Push event: 10
- Pull request event: 3
- Create event: 1
Last Year
- Release event: 1
- Watch event: 4
- Issue comment event: 1
- Push event: 10
- Pull request event: 3
- Create event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- nkatyal (2)
- pbraslavski (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- org.apache.commons:commons-math3 3.5
- junit:junit 4.13.1 test
- args4j:args4j 2.32 compile
- com.google.code.gson:gson 2.8.0 compile
- com.cedarsoftware:java-util 1.12.0
- com.googlecode.matrix-toolkits-java:mtj 1.0.2
- commons-cli:commons-cli 1.2
- commons-configuration:commons-configuration 1.6
- commons-io:commons-io 2.7
- javax.annotation:javax.annotation-api 1.3.2
- junit:junit 4.13.1
- net.openhft:koloboke-api-jdk6-7 0.6.7
- net.openhft:koloboke-impl-jdk6-7 0.6.7
- org.apache.ant:ant 1.10.11
- org.apache.commons:commons-math3 3.2
- org.apache.httpcomponents:httpclient 4.5.13
- org.apache.lucene:lucene-analyzers-common 8.6.0
- org.apache.lucene:lucene-codecs 8.6.0
- org.apache.lucene:lucene-core 8.6.0
- org.apache.lucene:lucene-queryparser 8.6.0
- org.apache.thrift:libthrift 0.12.0
- org.htmlparser:htmlparser 2.1
- org.json:json 20160810
- org.mapdb:mapdb 3.0.7
- org.mongodb:bson 4.2.3
- org.slf4j:slf4j-api 1.7.10
- org.slf4j:slf4j-simple 1.7.10
- umass:RankLib 2.14.fixed
- beautifulsoup4 *
- bson *
- ir_datasets *
- jupyter *
- krovetzstemmer *
- lxml *
- numpy *
- pandas *
- protobuf ==3.20
- pyjnius *
- pytools *
- sentence-transformers *
- sentencepiece *
- spacy ==2.2.3
- thrift ==0.13.0
- torch *
- torchtext *
- tqdm *
- transformers *
- typing-extensions *
- ujson *
- urllib3 *
- fasttext ==0.9.1
- for *