fragment

Refined RAG pipeline for multi-hop QA using FAISS + Cohere (FRAGment)

https://github.com/roshanerukulla/fragment

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Refined RAG pipeline for multi-hop QA using FAISS + Cohere (FRAGment)

Basic Info

Host: GitHub
Owner: Roshanerukulla
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 0 Bytes

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

FRAGment: A Refined FlashRAG Pipeline for Multi-hop Question Answering

FRAGment is an enhanced version of the FlashRAG pipeline for multi-hop QA. It improves document retrieval, reranking, and generation using semantic techniques and modular upgrades. Built with FAISS, Cohere, and a subset of HotpotQA.

Features

FAISS-based dense retrieval using Sentence-BERT
Cohere Reranker to sort documents by semantic relevance
Cohere Generator (command-r-plus) for high-quality answer generation
Improved chunking to remove noise and short irrelevant segments
Evaluation support with EM, F1, Precision, and Recall metrics
Gradio UI with conversational memory for debugging and testing

Setup Instructions

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/FRAGment.git cd FRAGment

2. Create & activate virtual environment

python -m venv fragenv source fragenv/bin/activate # Linux/macOS frag_env\Scripts\activate # Windows

3. Install dependencies

pip install -r requirements.txt

🧾 Data: HotpotQA v1.1

FRAGment uses a subset of the HotpotQA dataset for training and evaluation.

Steps to Download & Prepare:

Download the official HotpotQA dataset:

wget https://rajpurkar.github.io/files/hotpotqa/hotpottrainv1.1.json

Create data directory and move file:

mkdir hotpotdata mv hotpottrainv1.1.json hotpotdata/

Chunk the dataset (use limit for faster testing):

python scripts/chunkdoccorpus.py --limit 15000

Build FAISS index:

python scripts/buildfaissindex.py

Evaluation

Run evaluation script after generating answers:

python scripts/eval_results.py

🎮 Run the Web App

python -m scripts.web_app

Acknowledgements

FlashRAG base: https://github.com/thunlp/FlashRAG
HotpotQA dataset: https://hotpotqa.github.io
FAISS: https://github.com/facebookresearch/faiss
Cohere: https://cohere.com

Owner

Login: Roshanerukulla
Kind: user

Repositories: 1
Profile: https://github.com/Roshanerukulla

Citation (CITATION.cff)

cff-version: 1.2.0
date-released: 2024-05
message: "If you use this software, please cite it as below."
authors:
- family-names: "Jin"
  given-names: "Jiajie"
- family-names: "Zhu"
  given-names: "Yutao"
- family-names: "Yang"
  given-names: "Xinyu"
- family-names: "Zhang"
  given-names: "Chenghao"
- family-names: "Dou"
  given-names: "Zhicheng"
- family-names: "Wen"
  given-names: "Ji-Rong"
title: "FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research"
url: "https://arxiv.org/abs/2405.13576"
preferred-citation:
  type: article
  authors:
    - family-names: "Jin"
      given-names: "Jiajie"
    - family-names: "Zhu"
      given-names: "Yutao"
    - family-names: "Yang"
      given-names: "Xinyu"
    - family-names: "Zhang"
      given-names: "Chenghao"
    - family-names: "Dou"
      given-names: "Zhicheng"
    - family-names: "Wen"
      given-names: "Ji-Rong"
  title: "FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research"
  journal: "CoRR"
  volume: "abs/2405.13576"
  year: 2024
  url: "https://arxiv.org/abs/2405.13576"
  eprinttype: "arXiv"
  eprint: "2405.13576"

GitHub Events

Total

Push event: 4
Create event: 3

Last Year

Push event: 4
Create event: 3

Dependencies

requirements.txt pypi

GitPython ==3.1.44
Jinja2 ==3.1.3
MarkupSafe ==2.1.5
PyStemmer ==2.2.0.3
PyYAML ==6.0.2
Pygments ==2.19.1
accelerate ==1.5.2
aiofiles ==23.2.1
aiohappyeyeballs ==2.6.1
aiohttp ==3.11.14
aiosignal ==1.3.2
altair ==5.5.0
annotated-types ==0.7.0
anyio ==4.9.0
attrs ==25.3.0
base58 ==2.1.1
blinker ==1.9.0
blis ==1.2.0
bm25s ==0.2.0
cachetools ==5.5.2
catalogue ==2.0.10
certifi ==2025.1.31
charset-normalizer ==3.4.1
chonkie ==0.5.1
click ==8.1.8
cloudpathlib ==0.21.0
cohere ==5.14.0
colorama ==0.4.6
confection ==0.1.5
cymem ==2.0.11
datasets ==3.5.0
dill ==0.3.8
distro ==1.9.0
dotenv ==0.9.9
faiss-cpu ==1.10.0
fastapi ==0.115.12
fastavro ==1.10.0
ffmpy ==0.5.0
filelock ==3.13.1
frozenlist ==1.5.0
fschat ==0.2.36
fsspec ==2024.6.1
gitdb ==4.0.12
gradio ==5.23.2
gradio_client ==1.8.0
groovy ==0.1.2
h11 ==0.14.0
httpcore ==1.0.7
httpx ==0.28.1
httpx-sse ==0.4.0
huggingface-hub ==0.30.1
idna ==3.10
ijson ==3.3.0
jieba ==0.42.1
jiter ==0.9.0
joblib ==1.4.2
jsonschema ==4.23.0
jsonschema-specifications ==2024.10.1
langcodes ==3.5.0
langid ==1.1.6
language_data ==1.3.0
latex2mathml ==3.77.0
llvmlite ==0.44.0
marisa-trie ==1.2.1
markdown-it-py ==3.0.0
markdown2 ==2.5.3
mdurl ==0.1.2
mpmath ==1.3.0
multidict ==6.2.0
multiprocess ==0.70.16
murmurhash ==1.0.12
narwhals ==1.33.0
networkx ==3.3
nh3 ==0.2.21
nltk ==3.9.1
numba ==0.61.0
numpy ==1.26.4
openai ==1.70.0
orjson ==3.10.16
packaging ==24.2
pandas ==2.2.3
peft ==0.15.1
pillow ==11.0.0
preshed ==3.0.9
prompt_toolkit ==3.0.50
propcache ==0.3.1
protobuf ==5.29.4
psutil ==7.0.0
pyarrow ==19.0.1
pydantic ==2.11.1
pydantic_core ==2.33.0
pydeck ==0.9.1
pydub ==0.25.1
python-dateutil ==2.9.0.post0
python-dotenv ==1.1.0
python-multipart ==0.0.20
pytz ==2025.2
rank-bm25 ==0.2.2
referencing ==0.36.2
regex ==2024.11.6
requests ==2.32.3
rich ==14.0.0
rouge ==1.0.1
rouge-chinese ==1.0.3
rpds-py ==0.24.0
ruff ==0.11.2
safehttpx ==0.1.6
safetensors ==0.5.3
scikit-learn ==1.6.1
scipy ==1.15.2
semantic-version ==2.10.0
sentence-transformers ==4.0.2
shellingham ==1.5.4
shortuuid ==1.0.13
six ==1.17.0
smart-open ==7.1.0
smmap ==5.0.2
sniffio ==1.3.1
spacy ==3.8.4
spacy-legacy ==3.0.12
spacy-loggers ==1.0.5
srsly ==2.5.1
starlette ==0.46.1
streamlit ==1.44.0
svgwrite ==1.4.3
sympy ==1.13.1
tenacity ==9.0.0
thinc ==8.3.4
threadpoolctl ==3.6.0
tiktoken ==0.9.0
tokenizers ==0.21.1
toml ==0.10.2
tomlkit ==0.13.2
torch ==2.2.2
torchaudio ==2.2.2
torchvision ==0.17.2
tornado ==6.4.2
tqdm ==4.67.1
transformers ==4.50.3
typer ==0.15.2
types-requests ==2.32.0.20250328
typing-inspection ==0.4.0
typing_extensions ==4.9.0
tzdata ==2025.2
urllib3 ==2.3.0
uvicorn ==0.34.0
wasabi ==1.1.3
watchdog ==6.0.0
wavedrom ==2.0.3.post3
wcwidth ==0.2.13
weasel ==0.4.1
websockets ==15.0.1
wrapt ==1.17.2
xxhash ==3.5.0
yarl ==1.18.3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science