fragment

Refined RAG pipeline for multi-hop QA using FAISS + Cohere (FRAGment)

https://github.com/roshanerukulla/fragment

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Refined RAG pipeline for multi-hop QA using FAISS + Cohere (FRAGment)

Basic Info
  • Host: GitHub
  • Owner: Roshanerukulla
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 0 Bytes
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

FRAGment: A Refined FlashRAG Pipeline for Multi-hop Question Answering

FRAGment is an enhanced version of the FlashRAG pipeline for multi-hop QA. It improves document retrieval, reranking, and generation using semantic techniques and modular upgrades. Built with FAISS, Cohere, and a subset of HotpotQA.


Features

  • FAISS-based dense retrieval using Sentence-BERT
  • Cohere Reranker to sort documents by semantic relevance
  • Cohere Generator (command-r-plus) for high-quality answer generation
  • Improved chunking to remove noise and short irrelevant segments
  • Evaluation support with EM, F1, Precision, and Recall metrics
  • Gradio UI with conversational memory for debugging and testing

Setup Instructions

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/FRAGment.git cd FRAGment

2. Create & activate virtual environment

python -m venv fragenv source fragenv/bin/activate # Linux/macOS frag_env\Scripts\activate # Windows

3. Install dependencies

pip install -r requirements.txt


🧾 Data: HotpotQA v1.1

FRAGment uses a subset of the HotpotQA dataset for training and evaluation.

Steps to Download & Prepare:

  1. Download the official HotpotQA dataset:

wget https://rajpurkar.github.io/files/hotpotqa/hotpottrainv1.1.json

  1. Create data directory and move file:

mkdir hotpotdata mv hotpottrainv1.1.json hotpotdata/

  1. Chunk the dataset (use limit for faster testing):

python scripts/chunkdoccorpus.py --limit 15000

  1. Build FAISS index:

python scripts/buildfaissindex.py


Evaluation

Run evaluation script after generating answers:

python scripts/eval_results.py



🎮 Run the Web App

python -m scripts.web_app


Acknowledgements

  • FlashRAG base: https://github.com/thunlp/FlashRAG
  • HotpotQA dataset: https://hotpotqa.github.io
  • FAISS: https://github.com/facebookresearch/faiss
  • Cohere: https://cohere.com

Owner

  • Login: Roshanerukulla
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
date-released: 2024-05
message: "If you use this software, please cite it as below."
authors:
- family-names: "Jin"
  given-names: "Jiajie"
- family-names: "Zhu"
  given-names: "Yutao"
- family-names: "Yang"
  given-names: "Xinyu"
- family-names: "Zhang"
  given-names: "Chenghao"
- family-names: "Dou"
  given-names: "Zhicheng"
- family-names: "Wen"
  given-names: "Ji-Rong"
title: "FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research"
url: "https://arxiv.org/abs/2405.13576"
preferred-citation:
  type: article
  authors:
    - family-names: "Jin"
      given-names: "Jiajie"
    - family-names: "Zhu"
      given-names: "Yutao"
    - family-names: "Yang"
      given-names: "Xinyu"
    - family-names: "Zhang"
      given-names: "Chenghao"
    - family-names: "Dou"
      given-names: "Zhicheng"
    - family-names: "Wen"
      given-names: "Ji-Rong"
  title: "FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research"
  journal: "CoRR"
  volume: "abs/2405.13576"
  year: 2024
  url: "https://arxiv.org/abs/2405.13576"
  eprinttype: "arXiv"
  eprint: "2405.13576"

GitHub Events

Total
  • Push event: 4
  • Create event: 3
Last Year
  • Push event: 4
  • Create event: 3

Dependencies

requirements.txt pypi
  • GitPython ==3.1.44
  • Jinja2 ==3.1.3
  • MarkupSafe ==2.1.5
  • PyStemmer ==2.2.0.3
  • PyYAML ==6.0.2
  • Pygments ==2.19.1
  • accelerate ==1.5.2
  • aiofiles ==23.2.1
  • aiohappyeyeballs ==2.6.1
  • aiohttp ==3.11.14
  • aiosignal ==1.3.2
  • altair ==5.5.0
  • annotated-types ==0.7.0
  • anyio ==4.9.0
  • attrs ==25.3.0
  • base58 ==2.1.1
  • blinker ==1.9.0
  • blis ==1.2.0
  • bm25s ==0.2.0
  • cachetools ==5.5.2
  • catalogue ==2.0.10
  • certifi ==2025.1.31
  • charset-normalizer ==3.4.1
  • chonkie ==0.5.1
  • click ==8.1.8
  • cloudpathlib ==0.21.0
  • cohere ==5.14.0
  • colorama ==0.4.6
  • confection ==0.1.5
  • cymem ==2.0.11
  • datasets ==3.5.0
  • dill ==0.3.8
  • distro ==1.9.0
  • dotenv ==0.9.9
  • faiss-cpu ==1.10.0
  • fastapi ==0.115.12
  • fastavro ==1.10.0
  • ffmpy ==0.5.0
  • filelock ==3.13.1
  • frozenlist ==1.5.0
  • fschat ==0.2.36
  • fsspec ==2024.6.1
  • gitdb ==4.0.12
  • gradio ==5.23.2
  • gradio_client ==1.8.0
  • groovy ==0.1.2
  • h11 ==0.14.0
  • httpcore ==1.0.7
  • httpx ==0.28.1
  • httpx-sse ==0.4.0
  • huggingface-hub ==0.30.1
  • idna ==3.10
  • ijson ==3.3.0
  • jieba ==0.42.1
  • jiter ==0.9.0
  • joblib ==1.4.2
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2024.10.1
  • langcodes ==3.5.0
  • langid ==1.1.6
  • language_data ==1.3.0
  • latex2mathml ==3.77.0
  • llvmlite ==0.44.0
  • marisa-trie ==1.2.1
  • markdown-it-py ==3.0.0
  • markdown2 ==2.5.3
  • mdurl ==0.1.2
  • mpmath ==1.3.0
  • multidict ==6.2.0
  • multiprocess ==0.70.16
  • murmurhash ==1.0.12
  • narwhals ==1.33.0
  • networkx ==3.3
  • nh3 ==0.2.21
  • nltk ==3.9.1
  • numba ==0.61.0
  • numpy ==1.26.4
  • openai ==1.70.0
  • orjson ==3.10.16
  • packaging ==24.2
  • pandas ==2.2.3
  • peft ==0.15.1
  • pillow ==11.0.0
  • preshed ==3.0.9
  • prompt_toolkit ==3.0.50
  • propcache ==0.3.1
  • protobuf ==5.29.4
  • psutil ==7.0.0
  • pyarrow ==19.0.1
  • pydantic ==2.11.1
  • pydantic_core ==2.33.0
  • pydeck ==0.9.1
  • pydub ==0.25.1
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.1.0
  • python-multipart ==0.0.20
  • pytz ==2025.2
  • rank-bm25 ==0.2.2
  • referencing ==0.36.2
  • regex ==2024.11.6
  • requests ==2.32.3
  • rich ==14.0.0
  • rouge ==1.0.1
  • rouge-chinese ==1.0.3
  • rpds-py ==0.24.0
  • ruff ==0.11.2
  • safehttpx ==0.1.6
  • safetensors ==0.5.3
  • scikit-learn ==1.6.1
  • scipy ==1.15.2
  • semantic-version ==2.10.0
  • sentence-transformers ==4.0.2
  • shellingham ==1.5.4
  • shortuuid ==1.0.13
  • six ==1.17.0
  • smart-open ==7.1.0
  • smmap ==5.0.2
  • sniffio ==1.3.1
  • spacy ==3.8.4
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • srsly ==2.5.1
  • starlette ==0.46.1
  • streamlit ==1.44.0
  • svgwrite ==1.4.3
  • sympy ==1.13.1
  • tenacity ==9.0.0
  • thinc ==8.3.4
  • threadpoolctl ==3.6.0
  • tiktoken ==0.9.0
  • tokenizers ==0.21.1
  • toml ==0.10.2
  • tomlkit ==0.13.2
  • torch ==2.2.2
  • torchaudio ==2.2.2
  • torchvision ==0.17.2
  • tornado ==6.4.2
  • tqdm ==4.67.1
  • transformers ==4.50.3
  • typer ==0.15.2
  • types-requests ==2.32.0.20250328
  • typing-inspection ==0.4.0
  • typing_extensions ==4.9.0
  • tzdata ==2025.2
  • urllib3 ==2.3.0
  • uvicorn ==0.34.0
  • wasabi ==1.1.3
  • watchdog ==6.0.0
  • wavedrom ==2.0.3.post3
  • wcwidth ==0.2.13
  • weasel ==0.4.1
  • websockets ==15.0.1
  • wrapt ==1.17.2
  • xxhash ==3.5.0
  • yarl ==1.18.3