fragment
Refined RAG pipeline for multi-hop QA using FAISS + Cohere (FRAGment)
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Repository
Refined RAG pipeline for multi-hop QA using FAISS + Cohere (FRAGment)
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
FRAGment: A Refined FlashRAG Pipeline for Multi-hop Question Answering
FRAGment is an enhanced version of the FlashRAG pipeline for multi-hop QA. It improves document retrieval, reranking, and generation using semantic techniques and modular upgrades. Built with FAISS, Cohere, and a subset of HotpotQA.
Features
- FAISS-based dense retrieval using Sentence-BERT
- Cohere Reranker to sort documents by semantic relevance
- Cohere Generator (command-r-plus) for high-quality answer generation
- Improved chunking to remove noise and short irrelevant segments
- Evaluation support with EM, F1, Precision, and Recall metrics
- Gradio UI with conversational memory for debugging and testing
Setup Instructions
1. Clone the repository
git clone https://github.com/YOUR_USERNAME/FRAGment.git cd FRAGment
2. Create & activate virtual environment
python -m venv fragenv source fragenv/bin/activate # Linux/macOS frag_env\Scripts\activate # Windows
3. Install dependencies
pip install -r requirements.txt
🧾 Data: HotpotQA v1.1
FRAGment uses a subset of the HotpotQA dataset for training and evaluation.
Steps to Download & Prepare:
- Download the official HotpotQA dataset:
wget https://rajpurkar.github.io/files/hotpotqa/hotpottrainv1.1.json
- Create data directory and move file:
mkdir hotpotdata mv hotpottrainv1.1.json hotpotdata/
- Chunk the dataset (use limit for faster testing):
python scripts/chunkdoccorpus.py --limit 15000
- Build FAISS index:
python scripts/buildfaissindex.py
Evaluation
Run evaluation script after generating answers:
python scripts/eval_results.py
🎮 Run the Web App
python -m scripts.web_app
Acknowledgements
- FlashRAG base: https://github.com/thunlp/FlashRAG
- HotpotQA dataset: https://hotpotqa.github.io
- FAISS: https://github.com/facebookresearch/faiss
- Cohere: https://cohere.com
Owner
- Login: Roshanerukulla
- Kind: user
- Repositories: 1
- Profile: https://github.com/Roshanerukulla
Citation (CITATION.cff)
cff-version: 1.2.0
date-released: 2024-05
message: "If you use this software, please cite it as below."
authors:
- family-names: "Jin"
given-names: "Jiajie"
- family-names: "Zhu"
given-names: "Yutao"
- family-names: "Yang"
given-names: "Xinyu"
- family-names: "Zhang"
given-names: "Chenghao"
- family-names: "Dou"
given-names: "Zhicheng"
- family-names: "Wen"
given-names: "Ji-Rong"
title: "FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research"
url: "https://arxiv.org/abs/2405.13576"
preferred-citation:
type: article
authors:
- family-names: "Jin"
given-names: "Jiajie"
- family-names: "Zhu"
given-names: "Yutao"
- family-names: "Yang"
given-names: "Xinyu"
- family-names: "Zhang"
given-names: "Chenghao"
- family-names: "Dou"
given-names: "Zhicheng"
- family-names: "Wen"
given-names: "Ji-Rong"
title: "FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research"
journal: "CoRR"
volume: "abs/2405.13576"
year: 2024
url: "https://arxiv.org/abs/2405.13576"
eprinttype: "arXiv"
eprint: "2405.13576"
GitHub Events
Total
- Push event: 4
- Create event: 3
Last Year
- Push event: 4
- Create event: 3
Dependencies
- GitPython ==3.1.44
- Jinja2 ==3.1.3
- MarkupSafe ==2.1.5
- PyStemmer ==2.2.0.3
- PyYAML ==6.0.2
- Pygments ==2.19.1
- accelerate ==1.5.2
- aiofiles ==23.2.1
- aiohappyeyeballs ==2.6.1
- aiohttp ==3.11.14
- aiosignal ==1.3.2
- altair ==5.5.0
- annotated-types ==0.7.0
- anyio ==4.9.0
- attrs ==25.3.0
- base58 ==2.1.1
- blinker ==1.9.0
- blis ==1.2.0
- bm25s ==0.2.0
- cachetools ==5.5.2
- catalogue ==2.0.10
- certifi ==2025.1.31
- charset-normalizer ==3.4.1
- chonkie ==0.5.1
- click ==8.1.8
- cloudpathlib ==0.21.0
- cohere ==5.14.0
- colorama ==0.4.6
- confection ==0.1.5
- cymem ==2.0.11
- datasets ==3.5.0
- dill ==0.3.8
- distro ==1.9.0
- dotenv ==0.9.9
- faiss-cpu ==1.10.0
- fastapi ==0.115.12
- fastavro ==1.10.0
- ffmpy ==0.5.0
- filelock ==3.13.1
- frozenlist ==1.5.0
- fschat ==0.2.36
- fsspec ==2024.6.1
- gitdb ==4.0.12
- gradio ==5.23.2
- gradio_client ==1.8.0
- groovy ==0.1.2
- h11 ==0.14.0
- httpcore ==1.0.7
- httpx ==0.28.1
- httpx-sse ==0.4.0
- huggingface-hub ==0.30.1
- idna ==3.10
- ijson ==3.3.0
- jieba ==0.42.1
- jiter ==0.9.0
- joblib ==1.4.2
- jsonschema ==4.23.0
- jsonschema-specifications ==2024.10.1
- langcodes ==3.5.0
- langid ==1.1.6
- language_data ==1.3.0
- latex2mathml ==3.77.0
- llvmlite ==0.44.0
- marisa-trie ==1.2.1
- markdown-it-py ==3.0.0
- markdown2 ==2.5.3
- mdurl ==0.1.2
- mpmath ==1.3.0
- multidict ==6.2.0
- multiprocess ==0.70.16
- murmurhash ==1.0.12
- narwhals ==1.33.0
- networkx ==3.3
- nh3 ==0.2.21
- nltk ==3.9.1
- numba ==0.61.0
- numpy ==1.26.4
- openai ==1.70.0
- orjson ==3.10.16
- packaging ==24.2
- pandas ==2.2.3
- peft ==0.15.1
- pillow ==11.0.0
- preshed ==3.0.9
- prompt_toolkit ==3.0.50
- propcache ==0.3.1
- protobuf ==5.29.4
- psutil ==7.0.0
- pyarrow ==19.0.1
- pydantic ==2.11.1
- pydantic_core ==2.33.0
- pydeck ==0.9.1
- pydub ==0.25.1
- python-dateutil ==2.9.0.post0
- python-dotenv ==1.1.0
- python-multipart ==0.0.20
- pytz ==2025.2
- rank-bm25 ==0.2.2
- referencing ==0.36.2
- regex ==2024.11.6
- requests ==2.32.3
- rich ==14.0.0
- rouge ==1.0.1
- rouge-chinese ==1.0.3
- rpds-py ==0.24.0
- ruff ==0.11.2
- safehttpx ==0.1.6
- safetensors ==0.5.3
- scikit-learn ==1.6.1
- scipy ==1.15.2
- semantic-version ==2.10.0
- sentence-transformers ==4.0.2
- shellingham ==1.5.4
- shortuuid ==1.0.13
- six ==1.17.0
- smart-open ==7.1.0
- smmap ==5.0.2
- sniffio ==1.3.1
- spacy ==3.8.4
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- srsly ==2.5.1
- starlette ==0.46.1
- streamlit ==1.44.0
- svgwrite ==1.4.3
- sympy ==1.13.1
- tenacity ==9.0.0
- thinc ==8.3.4
- threadpoolctl ==3.6.0
- tiktoken ==0.9.0
- tokenizers ==0.21.1
- toml ==0.10.2
- tomlkit ==0.13.2
- torch ==2.2.2
- torchaudio ==2.2.2
- torchvision ==0.17.2
- tornado ==6.4.2
- tqdm ==4.67.1
- transformers ==4.50.3
- typer ==0.15.2
- types-requests ==2.32.0.20250328
- typing-inspection ==0.4.0
- typing_extensions ==4.9.0
- tzdata ==2025.2
- urllib3 ==2.3.0
- uvicorn ==0.34.0
- wasabi ==1.1.3
- watchdog ==6.0.0
- wavedrom ==2.0.3.post3
- wcwidth ==0.2.13
- weasel ==0.4.1
- websockets ==15.0.1
- wrapt ==1.17.2
- xxhash ==3.5.0
- yarl ==1.18.3