https://github.com/armanasq/rag-pdf-streamlit

This repository contains a Streamlit app for retrieval-augmented generation on PDFs using HuggingFace models. Upload PDFs, ingest their content, and query information using natural language. The app supports GPU acceleration and multilingual queries with precise language detection. Ideal for efficient and accurate document analysis.

https://github.com/armanasq/rag-pdf-streamlit

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

This repository contains a Streamlit app for retrieval-augmented generation on PDFs using HuggingFace models. Upload PDFs, ingest their content, and query information using natural language. The app supports GPU acceleration and multilingual queries with precise language detection. Ideal for efficient and accurate document analysis.

Basic Info
  • Host: GitHub
  • Owner: Armanasq
  • Language: Python
  • Default Branch: main
  • Size: 7.81 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme

README.md

PDF-Insight: Streamlit Retrieval-Augmented Generation (RAG) 🗞️

image

This repository contains a Streamlit application designed for performing retrieval-augmented generation on PDF documents using large language models and embeddings from HuggingFace. The application allows users to upload a PDF, ingest its content into a vector store, and query the document using natural language.

Features

  • PDF Upload and Display: Upload and view PDF files within the application.
  • Data Ingestion: Ingest PDF content into a persistent storage context for efficient querying.
  • Query Handling: Perform natural language queries on the ingested PDF content.
  • Language Detection: Detects the language of the query to ensure accurate processing.

Setup

Prerequisites

  • Python 3.8+
  • Streamlit
  • HuggingFace Transformers
  • Llama Index
  • dotenv
  • langdetect

Installation

  1. Clone the repository: bash git clone https://github.com/Armanasq/RAG-PDF-Streamli.git cd your-repo

  2. Install dependencies: bash pip install -r requirements.txt

  3. Set up environment variables: Create a .env file in the root directory and add your HuggingFace API token: env HUGGINGFACE_TOKEN=your_huggingface_api_token

Usage

  1. Run the Streamlit application: bash streamlit run streamlitRAG.py

  2. Upload a PDF:

    • Use the file uploader in the Streamlit interface to upload a PDF.
  3. Ingest PDF Content:

    • The content of the uploaded PDF will be ingested and stored for querying.
  4. Query the PDF Content:

    • Enter a query related to the PDF content in the text input box.
    • The application will return relevant information based on the content of the PDF.

Code Overview

Environment Setup

  • Load Environment Variables: python load_dotenv() hf_token = os.getenv("HUGGINGFACE_TOKEN")

Llama Index Configuration

  • Configure Llama Index to use GPU and HuggingFace models: python Settings.llm = HuggingFaceInferenceAPI( model_name="mistralai/Mistral-7B-Instruct-v0.3", tokenizer_name="mistralai/Mistral-7B-Instruct-v0.3", context_window=5000, max_new_tokens=1024, generate_kwargs={"temperature": 0.1}, device=0 # Use GPU 0, set to -1 for CPU ) Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5", device=0 # Use GPU 0, set to -1 for CPU )

Data Handling

  • PDF Display Function: python def display_pdf(file_path): with open(file_path, "rb") as f: base64_pdf = base64.b64encode(f.read()).decode('utf-8') pdf_display = f'<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="600" type="application/pdf"></iframe>' return pdf_display

  • Data Ingestion: python def data_ingestion(): documents = SimpleDirectoryReader(DATA_DIR).load_data() storage_context = StorageContext.from_defaults() index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir=PERSIST_DIR)

Query Handling

  • Handle Query Function: ```python def handlequery(query, lang): storagecontext = StorageContext.fromdefaults(persistdir=PERSISTDIR) index = loadindexfromstorage(storagecontext) chattextqamsgs = [ ( "user", """You are a Q&A assistant. Your main goal is to provide answers as accurately as possible, based on the instructions and context you have been given. If a question does not match the provided context or is outside the scope of the document, kindly advise the user to ask questions within the context of the document. Context: {contextstr} Question: {querystr} """ ) ] textqatemplate = ChatPromptTemplate.frommessages(chattextqamsgs)

    query_engine = index.as_query_engine(text_qa_template=text_qa_template)
    answer = query_engine.query(query)
    
    if hasattr(answer, 'response'):
        return answer.response, lang
    elif isinstance(answer, dict) and 'response' in answer:
        return answer['response'], lang
    else:
        return "Sorry, I couldn't find an answer.", lang
    

    ```

Streamlit UI

  • Streamlit Interface: ```python st.title("(PDF) Information and Inference 🗞️") st.markdown("## Retrieval-Augmented Generation") st.markdown("Start chat ...🚀")

    uploadedfile = st.fileuploader("Upload your PDF file", type="pdf") query = st.textinput("Ask me anything about the content of the PDF:") chathistory = st.empty()

    if uploadedfile: pdfdisplay = processfile(uploadedfile) st.markdown(pdfdisplay, unsafeallow_html=True)

    if query: lang = detect(query) response, lang = handlequery(query, lang) chathistory.text_area("Chat History", f"User: {query}\nAssistant: {response}", height=300) ```

Contributing

Contributions are welcome. Please fork the repository and submit a pull request for any feature requests or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Owner

  • Name: Arman Asgharpoor
  • Login: Armanasq
  • Kind: user
  • Company: University of Tehran

Avionics Engineer M.Sc. Space Engineering AI / Deep Learning

GitHub Events

Total
  • Watch event: 1
  • Fork event: 1
Last Year
  • Watch event: 1
  • Fork event: 1

Dependencies

.history/requirements_20240614150134.txt pypi
.history/requirements_20240614150220.txt pypi
  • huggingface-hub ==0.17.2
  • langdetect ==1.0.9
  • llama-index ==0.1.4
  • python-dotenv ==1.0.0
  • streamlit ==1.33.0
requirements.txt pypi
  • huggingface-hub ==0.17.2
  • langdetect ==1.0.9
  • llama-index ==0.1.4
  • python-dotenv ==1.0.0
  • streamlit ==1.33.0