porag
Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Keywords
Repository
Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.
Basic Info
Statistics
- Stars: 46
- Watchers: 2
- Forks: 13
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.
Use Cases
- Interact with your Bengali data in Bengali.
- Ask questions about your Bengali text and get answers.
How It Works
- LLM Framework: Transformers
- RAG Framework: Langchain
- Chunking: Recursive Character Split
- Vector Store: ChromaDB
- Data Ingestion: Currently supports text (.txt) files only due to the lack of reliable Bengali PDF parsing tools.
Configurability
- Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
- Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
- Hyperparameter Control: Adjust
max_new_tokens,top_p,top_k,temperature,chunk_size,chunk_overlap, andk. - Toggle Quantization mode: Pass
--quantizationargument to toggle between different types of model including LoRA and 4bit quantization.
Installation
- Install Python: Download and install Python from python.org.
- Clone the Repository:
bash git clone https://github.com/Bangla-RAG/PoRAG.git cd PoRAG - Install Required Libraries:
bash pip install -r requirements.txt
Click to view example `requirements.txt`
```txt langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 chromadb==0.5.0 accelerate==0.31.0 peft==0.11.1 transformers==4.40.1 bitsandbytes==0.41.3 sentence-transformers==3.0.1 rich==13.7.1 ```Running the Pipeline
- Prepare Your Bangla Text Corpus: Create a text file (e.g.,
test.txt) with the Bengali text you want to use. - Run the RAG Pipeline:
bash python main.py --text_path test.txt - Interact with the System: Type your question and press Enter to get a response based on the retrieved information.
Example
bash
আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।
Parameters description
You can pass these arguments and adjust their values during each runs.
| Flag Name | Type | Description | Instructions |
|---|---|---|---|
chat_model |
str | The ID of the chat model. It can be either a Hugging Face model ID or a local path to the model. | Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "hassanaliemon/bn_rag_llama3-8b". |
embed_model |
str | The ID of the embedding model. It can be either a Hugging Face model ID or a local path to the model. | Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "l3cube-pune/bengali-sentence-similarity-sbert". |
k |
int | The number of documents to retrieve. | The default value is set to 4. |
top_k |
int | The top_k parameter for the chat model. | The default value is set to 2. |
top_p |
float | The top_p parameter for the chat model. | The default value is set to 0.6. |
temperature |
float | The temperature parameter for the chat model. | The default value is set to 0.6. |
max_new_tokens |
int | The maximum number of new tokens to generate. | The default value is set to 256. |
chunk_size |
int | The chunk size for text splitting. | The default value is set to 500. |
chunk_overlap |
int | The chunk overlap for text splitting. | The default value is set to 150. |
text_path |
str | The txt file path to the text file. | This is a required field. Provide the path to the text file you want to use. |
show_context |
bool | Whether to show the retrieved context or not. | Use --show_context flag to enable this feature. |
quantization |
bool | Whether to enable quantization(4bit) or not. | Use --quantization flag to enable this feature. |
hf_token |
str | Your Hugging Face API token. | The default value is set to None. Provide your Hugging Face API token if necessary. |
Key Milestones
- Default LLM: Trained a LLaMA-3 8B model
hassanaliemon/bn_rag_llama3-8bfor context-based QA. - Embedding Model: Tested
sagorsarker/bangla-bert-base,csebuetnlp/banglabert, and foundl3cube-pune/bengali-sentence-similarity-sbertto be most effective. - Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
- Ingestion System: Settled on text files after testing several PDF parsing solutions.
- Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
- Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
- Model Testing: Tested with the following models(quantized and lora versions):
Limitations
- PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
- Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
- Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.
Future Steps
- PDF Parsing: Develop a reliable Bengali-specific PDF parser.
- User Interface: Design a chat-like UI for easier interaction.
- Chat History Management: Implement a system for maintaining and accessing chat history.
Contribution and Feedback
We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.
Top Contributors
Disclaimer
This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.
References
Owner
- Name: Bangla RAG
- Login: Bangla-RAG
- Kind: organization
- Repositories: 1
- Profile: https://github.com/Bangla-RAG
We are developing PoRAG, a Bangla RAG Pipeline for easy interaction with Bengali text, providing accurate responses and customizable settings.
Citation (CITATION.cff)
cff-version: 1.2.0
title: Porag (Bangla RAG pipeline)
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Abdullah
given-names: Al Asif
email: asif.inc.bd@gmail.com
orcid: 'https://orcid.org/0009-0009-6166-9975'
- given-names: Hasan
family-names: Al Emon
email: hassanzahin@gmail.com
identifiers:
- type: url
value: 'https://github.com/Bangla-RAG/PoRAG'
repository-code: 'https://github.com/Bangla-RAG/PoRAG'
url: 'https://github.com/Bangla-RAG'
abstract: >-
Fully Configurable RAG Pipeline for Bengali Language RAG
Applications. Supports both local and huggingface models,
built with ChromaDB and Langchain.
keywords:
- bengali_nlp
- bangla_nlp
- nlp
- Bengaliai
- rag
- bangallm
- langchain
- chromadb
- llama
license: MIT
GitHub Events
Total
- Watch event: 11
- Fork event: 5
Last Year
- Watch event: 11
- Fork event: 5
Committers
Last synced: 6 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| himisir | 3****r | 17 |
| Abdullah Al Asif | 3****0 | 6 |
| hassanaliemon | h****n@g****m | 1 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 5
- Total pull requests: 0
- Average time to close issues: about 20 hours
- Average time to close pull requests: N/A
- Total issue authors: 3
- Total pull request authors: 0
- Average comments per issue: 1.2
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- porimol (2)
- abuzahid (2)
- mdhasanai (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- accelerate *
- argparse *
- bitsandbytes *
- chromadb *
- langchain *
- langchain-community *
- peft *
- rich *
- sentence_transformers *
- transformers *