porag

Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.

https://github.com/bangla-rag/porag

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary

Keywords

ai bengali bengali-nlp chromadb langchain llama3 llm nlp rag transformers
Last synced: 6 months ago · JSON representation ·

Repository

Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.

Basic Info
  • Host: GitHub
  • Owner: Bangla-RAG
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 678 KB
Statistics
  • Stars: 46
  • Watchers: 2
  • Forks: 13
  • Open Issues: 0
  • Releases: 0
Topics
ai bengali bengali-nlp chromadb langchain llama3 llm nlp rag transformers
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Banner

LinkedIn: Abdullah Al Asif LinkedIn: Hasan Ali Emon

Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.

Use Cases

  • Interact with your Bengali data in Bengali.
  • Ask questions about your Bengali text and get answers.

How It Works

Configurability

  • Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
  • Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
  • Hyperparameter Control: Adjust max_new_tokens, top_p, top_k, temperature, chunk_size, chunk_overlap, and k.
  • Toggle Quantization mode: Pass --quantization argument to toggle between different types of model including LoRA and 4bit quantization.

Installation

  1. Install Python: Download and install Python from python.org.
  2. Clone the Repository: bash git clone https://github.com/Bangla-RAG/PoRAG.git cd PoRAG
  3. Install Required Libraries: bash pip install -r requirements.txt
Click to view example `requirements.txt` ```txt langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 chromadb==0.5.0 accelerate==0.31.0 peft==0.11.1 transformers==4.40.1 bitsandbytes==0.41.3 sentence-transformers==3.0.1 rich==13.7.1 ```

Running the Pipeline

  1. Prepare Your Bangla Text Corpus: Create a text file (e.g., test.txt) with the Bengali text you want to use.
  2. Run the RAG Pipeline: bash python main.py --text_path test.txt
  3. Interact with the System: Type your question and press Enter to get a response based on the retrieved information.

Example

bash আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়? উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।

Parameters description

You can pass these arguments and adjust their values during each runs.

Flag Name Type Description Instructions
chat_model str The ID of the chat model. It can be either a Hugging Face model ID or a local path to the model. Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "hassanaliemon/bn_rag_llama3-8b".
embed_model str The ID of the embedding model. It can be either a Hugging Face model ID or a local path to the model. Use the model ID from the HuggingFace model card or provide the local path to the model. The default value is set to "l3cube-pune/bengali-sentence-similarity-sbert".
k int The number of documents to retrieve. The default value is set to 4.
top_k int The top_k parameter for the chat model. The default value is set to 2.
top_p float The top_p parameter for the chat model. The default value is set to 0.6.
temperature float The temperature parameter for the chat model. The default value is set to 0.6.
max_new_tokens int The maximum number of new tokens to generate. The default value is set to 256.
chunk_size int The chunk size for text splitting. The default value is set to 500.
chunk_overlap int The chunk overlap for text splitting. The default value is set to 150.
text_path str The txt file path to the text file. This is a required field. Provide the path to the text file you want to use.
show_context bool Whether to show the retrieved context or not. Use --show_context flag to enable this feature.
quantization bool Whether to enable quantization(4bit) or not. Use --quantization flag to enable this feature.
hf_token str Your Hugging Face API token. The default value is set to None. Provide your Hugging Face API token if necessary.

Key Milestones

  • Default LLM: Trained a LLaMA-3 8B model hassanaliemon/bn_rag_llama3-8b for context-based QA.
  • Embedding Model: Tested sagorsarker/bangla-bert-base, csebuetnlp/banglabert, and found l3cube-pune/bengali-sentence-similarity-sbert to be most effective.
  • Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
  • Ingestion System: Settled on text files after testing several PDF parsing solutions.
  • Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
  • Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
  • Model Testing: Tested with the following models(quantized and lora versions):
    1. asif00/bangla-llama
    2. hassanaliemon/bn_rag_llama3-8b
    3. asif00/mistral-bangla
    4. KillerShoaib/llama-3-8b-bangla-4bit

Limitations

  • PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
  • Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
  • Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.

Future Steps

  • PDF Parsing: Develop a reliable Bengali-specific PDF parser.
  • User Interface: Design a chat-like UI for easier interaction.
  • Chat History Management: Implement a system for maintaining and accessing chat history.

Contribution and Feedback

We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.

Top Contributors

LinkedIn: Abdullah Al Asif Abdullah Al Asif

LinkedIn: Hasan Ali Emon Hasan Ali Emon

Disclaimer

This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.

References

  1. Transformers
  2. Langchain
  3. ChromaDB
  4. Sentence Transformers
  5. hassanaliemon/bnragllama3-8b
  6. l3cube-pune/bengali-sentence-similarity-sbert
  7. sagorsarker/bangla-bert-base
  8. csebuetnlp/banglabert
  9. asif00/bangla-llama
  10. KillerShoaib/llama-3-8b-bangla-4bit
  11. asif00/mistral-bangla

Owner

  • Name: Bangla RAG
  • Login: Bangla-RAG
  • Kind: organization

We are developing PoRAG, a Bangla RAG Pipeline for easy interaction with Bengali text, providing accurate responses and customizable settings.

Citation (CITATION.cff)

cff-version: 1.2.0
title: Porag (Bangla RAG pipeline)
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - family-names: Abdullah
    given-names: Al Asif
    email: asif.inc.bd@gmail.com
    orcid: 'https://orcid.org/0009-0009-6166-9975'
  - given-names: Hasan
    family-names: Al Emon
    email: hassanzahin@gmail.com
identifiers:
  - type: url
    value: 'https://github.com/Bangla-RAG/PoRAG'
repository-code: 'https://github.com/Bangla-RAG/PoRAG'
url: 'https://github.com/Bangla-RAG'
abstract: >-
  Fully Configurable RAG Pipeline for Bengali Language RAG
  Applications. Supports both local and huggingface models,
  built with ChromaDB and Langchain.
keywords:
  - bengali_nlp
  - bangla_nlp
  - nlp
  - Bengaliai
  - rag
  - bangallm
  - langchain
  - chromadb
  - llama
license: MIT

GitHub Events

Total
  • Watch event: 11
  • Fork event: 5
Last Year
  • Watch event: 11
  • Fork event: 5

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 24
  • Total Committers: 3
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.292
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
himisir 3****r 17
Abdullah Al Asif 3****0 6
hassanaliemon h****n@g****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 0
  • Average time to close issues: about 20 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 3
  • Total pull request authors: 0
  • Average comments per issue: 1.2
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • porimol (2)
  • abuzahid (2)
  • mdhasanai (1)
Pull Request Authors
Top Labels
Issue Labels
bug (2) documentation (1)
Pull Request Labels

Dependencies

requirements.txt pypi
  • accelerate *
  • argparse *
  • bitsandbytes *
  • chromadb *
  • langchain *
  • langchain-community *
  • peft *
  • rich *
  • sentence_transformers *
  • transformers *