flexneuart

Flexible classic and NeurAl Retrieval Toolkit

https://github.com/oaqa/flexneuart

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

ibm-model information-retrieval neural-networks question-answering
Last synced: 6 months ago · JSON representation ·

Repository

Flexible classic and NeurAl Retrieval Toolkit

Basic Info
  • Host: GitHub
  • Owner: oaqa
  • License: apache-2.0
  • Language: Java
  • Default Branch: master
  • Homepage:
  • Size: 36.4 MB
Statistics
  • Stars: 220
  • Watchers: 11
  • Forks: 35
  • Open Issues: 7
  • Releases: 5
Topics
ibm-model information-retrieval neural-networks question-answering
Created over 9 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Pypi version Downloads Downloads Join the chat at https://gitter.im/oaqa/FlexNeuART

FlexNeuART (flex-noo-art)

Flexible classic and NeurAl Retrieval Toolkit, or shortly FlexNeuART (intended pronunciation flex-noo-art) is a substantially reworked knn4qa package. The overview can be found in our EMNLP OSS workshop paper: Flexible retrieval with NMSLIB and FlexNeuART, 2020. Leonid Boytsov, Eric Nyberg.

In Aug-Dec 2020, we used this framework to generate best traditional and/or neural runs in the MSMARCO Document ranking task. In fact, our best traditional (non-neural) run slightly outperformed a couple of neural submissions. Please, see our write-up for details: Boytsov, Leonid. "Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard." arXiv preprint arXiv:2012.08020 (2020).

In 2021, after being outsripped by a number of participants, we again advanced to a good position with a help of newly implemented models for ranking long documents. Please, see our write-up for details: Boytsov, L., Lin, T., Gao, F., Zhao, Y., Huang, J., & Nyberg, E. (2022). Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding. At the moment of writing (October 2022), we have competitive submissions on both MS MARCO leaderboards.

Code corresponding to Neural Model 1 is not included as this may be subject to a third party patent. This model (together with its non-contextualized variant) is described and evaluated in our ECIR 2021 paper: Boytsov, Leonid, and Zico Kolter. "Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits." ECIR 2021.

In terms of pure effectiveness on long documents, other models (CEDR & PARADE) seem to be perform equally well (or somewhat better). They are available in our codebase. We are not aware of the patents inhibiting the use of the traditional (non-neural) Model 1.

Objectives

Develop & maintain a (relatively) light-weight modular middleware useful primarily for: * Research * Education * Evaluation & leaderboarding

Main features

  • Dense, sparse, or dense-sparse retrieval using Lucene and NMSLIB (dense embeddings can be created using any Sentence BERT model).
  • Multi-field multi-level forward indices (+parent-child field relations) that can store parsed and "raw" text input as well as sparse and dense vectors.
  • Forward indices can be created in append-only mode, which requires much less RAM.
  • Pluggable generic rankers (via a server)
  • SOTA neural (PARADE, BERT FirstP/MaxP/Sum, Longformer, COLBERT (re-ranking), dot-product Senence BERT models) and non-neural models (multi-field BM25, IBM Model 1).
  • Multi-GPU training and inference with out-of-the box support for ensembling
  • Basic experimentation framework (+LETOR)
  • Python API to use retrievers and rankers as well as to access indexed data.

Documentation

We support a number of neural BERT-based ranking models as well as strong traditional ranking models including IBM Model 1 (description of non-neural rankers to follow).

The framework supports data in generic JSONL format. We provide conversion (and in some cases download) scripts for the following collections: * Configurable dataset processing of standard datasets provided by ir-datasets. * MS MARCO v1 and v2 (documents and passages) * Wikipedia DPR (Natural Questions, Trivia QA, SQuAD) * Yahoo Answers * Cranfield (a small toy collection)

Acknowledgements

For neural network training FlexNeuART incorporates a substantially re-worked variant of CEDR (MacAvaney et al' 2019).

Owner

  • Name: Open Advancement of Question Answering Systems
  • Login: oaqa
  • Kind: organization

Citation (CITATION.cff)

@inproceedings{boytsov2020flexible,
  title={Flexible retrieval with NMSLIB and FlexNeuART},
  author={Boytsov, Leonid and Nyberg, Eric},
  booktitle={Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)},
  pages={32--43},
  year={2020}
}

GitHub Events

Total
  • Release event: 1
  • Watch event: 4
  • Issue comment event: 1
  • Push event: 10
  • Pull request event: 3
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 4
  • Issue comment event: 1
  • Push event: 10
  • Pull request event: 3
  • Create event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • nkatyal (2)
  • pbraslavski (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

java/lemur-code-r2792-RankLib-trunk/pom.xml maven
  • org.apache.commons:commons-math3 3.5
  • junit:junit 4.13.1 test
java/pom.xml maven
  • args4j:args4j 2.32 compile
  • com.google.code.gson:gson 2.8.0 compile
  • com.cedarsoftware:java-util 1.12.0
  • com.googlecode.matrix-toolkits-java:mtj 1.0.2
  • commons-cli:commons-cli 1.2
  • commons-configuration:commons-configuration 1.6
  • commons-io:commons-io 2.7
  • javax.annotation:javax.annotation-api 1.3.2
  • junit:junit 4.13.1
  • net.openhft:koloboke-api-jdk6-7 0.6.7
  • net.openhft:koloboke-impl-jdk6-7 0.6.7
  • org.apache.ant:ant 1.10.11
  • org.apache.commons:commons-math3 3.2
  • org.apache.httpcomponents:httpclient 4.5.13
  • org.apache.lucene:lucene-analyzers-common 8.6.0
  • org.apache.lucene:lucene-codecs 8.6.0
  • org.apache.lucene:lucene-core 8.6.0
  • org.apache.lucene:lucene-queryparser 8.6.0
  • org.apache.thrift:libthrift 0.12.0
  • org.htmlparser:htmlparser 2.1
  • org.json:json 20160810
  • org.mapdb:mapdb 3.0.7
  • org.mongodb:bson 4.2.3
  • org.slf4j:slf4j-api 1.7.10
  • org.slf4j:slf4j-simple 1.7.10
  • umass:RankLib 2.14.fixed
requirements.txt pypi
  • beautifulsoup4 *
  • bson *
  • ir_datasets *
  • jupyter *
  • krovetzstemmer *
  • lxml *
  • numpy *
  • pandas *
  • protobuf ==3.20
  • pyjnius *
  • pytools *
  • sentence-transformers *
  • sentencepiece *
  • spacy ==2.2.3
  • thrift ==0.13.0
  • torch *
  • torchtext *
  • tqdm *
  • transformers *
  • typing-extensions *
  • ujson *
  • urllib3 *
scripts/models/ndrm/requirements_add.txt pypi
  • fasttext ==0.9.1
setup.py pypi
  • for *