porag-bangla-rag
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary
Last synced: 7 months ago
·
JSON representation
·
Repository
Basic Info
- Host: GitHub
- Owner: hassanaliemon
- License: mit
- Language: Python
- Default Branch: main
- Size: 626 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Created almost 2 years ago
· Last pushed almost 2 years ago
Metadata Files
Readme
License
Citation
README.md
PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.
Use Cases
- Interact with your Bengali data in Bengali.
- Ask questions about your Bengali text and get answers.
How It Works
- LLM Framework: Transformers
- RAG Framework: Langchain
- Chunking: Recursive Character Split
- Vector Store: ChromaDB
- Data Ingestion: Currently supports text (.txt) files only due to the lack of reliable Bengali PDF parsing tools.
Configurability
- Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
- Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
- Hyperparameter Control: Adjust
max_new_tokens,top_p,top_k,temperature,chunk_size,chunk_overlap, andk. - Toggle Quantization mode: Pass
--quantizationargument to toggle between different types of model including LoRA and 4bit quantization.
Installation
- Clone the Repository:
bash git clone https://github.com/Bangla-RAG/PoRAG.git cd PoRAG - Install Required Libraries:
bash pip install -r requirements.txt
Click to view example `requirements.txt`
```txt langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 chromadb==0.5.0 accelerate==0.31.0 peft==0.11.1 transformers==4.40.1 bitsandbytes==0.41.3 sentence-transformers==3.0.1 rich==13.7.1 ```Running the Pipeline
- Prepare Your Bangla Text Corpus: Create a text file (e.g.,
test.txt) with the Bengali text you want to use. - Run the RAG Pipeline:
bash python main.py --text_path test.txt - Interact with the System: Type your question and press Enter to get a response based on the retrieved information.
Example
bash
আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।
Parameters description
You can pass these arguments and adjust their values during each runs.
| Flag Name | Type | Description | Instructions |
|---|---|---|---|
chat_model |
str | The ID of the chat model. It can be either a Hugging Face model ID or a local path to the model. | Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "hassanaliemon/bn_rag_llama3-8b". |
embed_model |
str | The ID of the embedding model. It can be either a Hugging Face model ID or a local path to the model. | Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "l3cube-pune/bengali-sentence-similarity-sbert". |
k |
int | The number of documents to retrieve. | The default value is set to 4. |
top_k |
int | The top_k parameter for the chat model. | The default value is set to 2. |
top_p |
float | The top_p parameter for the chat model. | The default value is set to 0.6. |
temperature |
float | The temperature parameter for the chat model. | The default value is set to 0.6. |
max_new_tokens |
int | The maximum number of new tokens to generate. | The default value is set to 256. |
chunk_size |
int | The chunk size for text splitting. | The default value is set to 500. |
chunk_overlap |
int | The chunk overlap for text splitting. | The default value is set to 150. |
text_path |
str | The txt file path to the text file. | This is a required field. Provide the path to the text file you want to use. |
show_context |
bool | Whether to show the retrieved context or not. | Use --show_context flag to enable this feature. |
quantization |
bool | Whether to enable quantization(4bit) or not. | Use --quantization flag to enable this feature. |
hf_token |
str | Your Hugging Face API token. | The default value is set to None. Provide your Hugging Face API token if necessary. |
Key Milestones
- Default LLM: Trained a LLaMA-3 8B model
hassanaliemon/bn_rag_llama3-8bfor context-based QA. - Embedding Model: Tested
sagorsarker/bangla-bert-base,csebuetnlp/banglabert, and foundl3cube-pune/bengali-sentence-similarity-sbertto be most effective. - Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
- Ingestion System: Settled on text files after testing several PDF parsing solutions.
- Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
- Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
- Model Testing: Tested with the following models(quantized and lora versions):
Limitations
- PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
- Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
- Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.
Future Steps
- PDF Parsing: Develop a reliable Bengali-specific PDF parser.
- User Interface: Design a chat-like UI for easier interaction.
- Chat History Management: Implement a system for maintaining and accessing chat history.
Contribution and Feedback
We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.
Disclaimer
This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.
References
Owner
- Login: hassanaliemon
- Kind: user
- Repositories: 2
- Profile: https://github.com/hassanaliemon
Citation (CITATION.cff)
cff-version: 1.2.0
title: Porag (Bangla RAG pipeline)
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Abdullah
given-names: Al Asif
email: asif.inc.bd@gmail.com
orcid: 'https://orcid.org/0009-0009-6166-9975'
- given-names: Hasan
family-names: Al Emon
email: hassanzahin@gmail.com
identifiers:
- type: url
value: 'https://github.com/Bangla-RAG/PoRAG'
repository-code: 'https://github.com/Bangla-RAG/PoRAG'
url: 'https://github.com/Bangla-RAG'
abstract: >-
Fully Configurable RAG Pipeline for Bengali Language RAG
Applications. Supports both local and huggingface models,
built with ChromaDB and Langchain.
keywords:
- bengali_nlp
- bangla_nlp
- nlp
- Bengaliai
- rag
- bangallm
- langchain
- chromadb
- llama
license: MIT
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1
Dependencies
requirements.txt
pypi
- accelerate *
- bitsandbytes ==0.41.3
- chromadb *
- langchain *
- langchain-community *
- langchain-core *
- peft *
- rich *
- sentence-transformers *
- transformers *