flexneuart

Flexible classic and NeurAl Retrieval Toolkit

https://github.com/oaqa/flexneuart

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Keywords

ibm-model information-retrieval neural-networks question-answering

Last synced: 9 months ago · JSON representation ·

Repository

Flexible classic and NeurAl Retrieval Toolkit

Basic Info

Host: GitHub
Owner: oaqa
License: apache-2.0
Language: Java
Default Branch: master
Homepage:
Size: 36.4 MB

Statistics

Stars: 220
Watchers: 11
Forks: 35
Open Issues: 7
Releases: 5

Topics

ibm-model information-retrieval neural-networks question-answering

Created almost 10 years ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

FlexNeuART (flex-noo-art)

Flexible classic and NeurAl Retrieval Toolkit, or shortly FlexNeuART (intended pronunciation flex-noo-art) is a substantially reworked knn4qa package. The overview can be found in our EMNLP OSS workshop paper: Flexible retrieval with NMSLIB and FlexNeuART, 2020. Leonid Boytsov, Eric Nyberg.

In Aug-Dec 2020, we used this framework to generate best traditional and/or neural runs in the MSMARCO Document ranking task. In fact, our best traditional (non-neural) run slightly outperformed a couple of neural submissions. Please, see our write-up for details: Boytsov, Leonid. "Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard." arXiv preprint arXiv:2012.08020 (2020).

In 2021, after being outsripped by a number of participants, we again advanced to a good position with a help of newly implemented models for ranking long documents. Please, see our write-up for details: Boytsov, L., Lin, T., Gao, F., Zhao, Y., Huang, J., & Nyberg, E. (2022). Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding. At the moment of writing (October 2022), we have competitive submissions on both MS MARCO leaderboards.

Code corresponding to Neural Model 1 is not included as this may be subject to a third party patent. This model (together with its non-contextualized variant) is described and evaluated in our ECIR 2021 paper: Boytsov, Leonid, and Zico Kolter. "Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits." ECIR 2021.

In terms of pure effectiveness on long documents, other models (CEDR & PARADE) seem to be perform equally well (or somewhat better). They are available in our codebase. We are not aware of the patents inhibiting the use of the traditional (non-neural) Model 1.

Objectives

Develop & maintain a (relatively) light-weight modular middleware useful primarily for: * Research * Education * Evaluation & leaderboarding

Main features

Dense, sparse, or dense-sparse retrieval using Lucene and NMSLIB (dense embeddings can be created using any Sentence BERT model).
Multi-field multi-level forward indices (+parent-child field relations) that can store parsed and "raw" text input as well as sparse and dense vectors.
Forward indices can be created in append-only mode, which requires much less RAM.
Pluggable generic rankers (via a server)
SOTA neural (PARADE, BERT FirstP/MaxP/Sum, Longformer, COLBERT (re-ranking), dot-product Senence BERT models) and non-neural models (multi-field BM25, IBM Model 1).
Multi-GPU training and inference with out-of-the box support for ensembling
Basic experimentation framework (+LETOR)
Python API to use retrievers and rankers as well as to access indexed data.

Documentation

We support a number of neural BERT-based ranking models as well as strong traditional ranking models including IBM Model 1 (description of non-neural rankers to follow).

The framework supports data in generic JSONL format. We provide conversion (and in some cases download) scripts for the following collections: * Configurable dataset processing of standard datasets provided by ir-datasets. * MS MARCO v1 and v2 (documents and passages) * Wikipedia DPR (Natural Questions, Trivia QA, SQuAD) * Yahoo Answers * Cranfield (a small toy collection)

Acknowledgements

For neural network training FlexNeuART incorporates a substantially re-worked variant of CEDR (MacAvaney et al' 2019).

Owner

Name: Open Advancement of Question Answering Systems
Login: oaqa
Kind: organization

Repositories: 45
Profile: https://github.com/oaqa

Citation (CITATION.cff)

@inproceedings{boytsov2020flexible,
  title={Flexible retrieval with NMSLIB and FlexNeuART},
  author={Boytsov, Leonid and Nyberg, Eric},
  booktitle={Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)},
  pages={32--43},
  year={2020}
}

GitHub Events

Total

Release event: 1
Watch event: 4
Issue comment event: 1
Push event: 10
Pull request event: 3
Create event: 1

Last Year

Release event: 1
Watch event: 4
Issue comment event: 1
Push event: 10
Pull request event: 3
Create event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

nkatyal (2)
pbraslavski (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

java/lemur-code-r2792-RankLib-trunk/pom.xml maven

org.apache.commons:commons-math3 3.5
junit:junit 4.13.1 test

java/pom.xml maven

args4j:args4j 2.32 compile
com.google.code.gson:gson 2.8.0 compile
com.cedarsoftware:java-util 1.12.0
com.googlecode.matrix-toolkits-java:mtj 1.0.2
commons-cli:commons-cli 1.2
commons-configuration:commons-configuration 1.6
commons-io:commons-io 2.7
javax.annotation:javax.annotation-api 1.3.2
junit:junit 4.13.1
net.openhft:koloboke-api-jdk6-7 0.6.7
net.openhft:koloboke-impl-jdk6-7 0.6.7
org.apache.ant:ant 1.10.11
org.apache.commons:commons-math3 3.2
org.apache.httpcomponents:httpclient 4.5.13
org.apache.lucene:lucene-analyzers-common 8.6.0
org.apache.lucene:lucene-codecs 8.6.0
org.apache.lucene:lucene-core 8.6.0
org.apache.lucene:lucene-queryparser 8.6.0
org.apache.thrift:libthrift 0.12.0
org.htmlparser:htmlparser 2.1
org.json:json 20160810
org.mapdb:mapdb 3.0.7
org.mongodb:bson 4.2.3
org.slf4j:slf4j-api 1.7.10
org.slf4j:slf4j-simple 1.7.10
umass:RankLib 2.14.fixed

requirements.txt pypi

beautifulsoup4 *
bson *
ir_datasets *
jupyter *
krovetzstemmer *
lxml *
numpy *
pandas *
protobuf ==3.20
pyjnius *
pytools *
sentence-transformers *
sentencepiece *
spacy ==2.2.3
thrift ==0.13.0
torch *
torchtext *
tqdm *
transformers *
typing-extensions *
ujson *
urllib3 *

scripts/models/ndrm/requirements_add.txt pypi

fasttext ==0.9.1

setup.py pypi

for *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science