https://github.com/armanasq/rag-pdf-streamlit

This repository contains a Streamlit app for retrieval-augmented generation on PDFs using HuggingFace models. Upload PDFs, ingest their content, and query information using natural language. The app supports GPU acceleration and multilingual queries with precise language detection. Ideal for efficient and accurate document analysis.

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Armanasq
Language: Python
Default Branch: main
Size: 7.81 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme

README.md

PDF-Insight: Streamlit Retrieval-Augmented Generation (RAG) 🗞️

This repository contains a Streamlit application designed for performing retrieval-augmented generation on PDF documents using large language models and embeddings from HuggingFace. The application allows users to upload a PDF, ingest its content into a vector store, and query the document using natural language.

Features

PDF Upload and Display: Upload and view PDF files within the application.
Data Ingestion: Ingest PDF content into a persistent storage context for efficient querying.
Query Handling: Perform natural language queries on the ingested PDF content.
Language Detection: Detects the language of the query to ensure accurate processing.

Setup

Prerequisites

Python 3.8+
Streamlit
HuggingFace Transformers
Llama Index
dotenv
langdetect

Installation

Clone the repository: bash git clone https://github.com/Armanasq/RAG-PDF-Streamli.git cd your-repo
Install dependencies: bash pip install -r requirements.txt
Set up environment variables: Create a .env file in the root directory and add your HuggingFace API token: env HUGGINGFACE_TOKEN=your_huggingface_api_token

Usage

Run the Streamlit application: bash streamlit run streamlitRAG.py
Upload a PDF:
- Use the file uploader in the Streamlit interface to upload a PDF.
Ingest PDF Content:
- The content of the uploaded PDF will be ingested and stored for querying.
Query the PDF Content:
- Enter a query related to the PDF content in the text input box.
- The application will return relevant information based on the content of the PDF.

Code Overview

Environment Setup

Load Environment Variables: python load_dotenv() hf_token = os.getenv("HUGGINGFACE_TOKEN")

Llama Index Configuration

Configure Llama Index to use GPU and HuggingFace models: python Settings.llm = HuggingFaceInferenceAPI( model_name="mistralai/Mistral-7B-Instruct-v0.3", tokenizer_name="mistralai/Mistral-7B-Instruct-v0.3", context_window=5000, max_new_tokens=1024, generate_kwargs={"temperature": 0.1}, device=0 # Use GPU 0, set to -1 for CPU ) Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5", device=0 # Use GPU 0, set to -1 for CPU )

Data Handling

PDF Display Function: python def display_pdf(file_path): with open(file_path, "rb") as f: base64_pdf = base64.b64encode(f.read()).decode('utf-8') pdf_display = f'<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="600" type="application/pdf"></iframe>' return pdf_display
Data Ingestion: python def data_ingestion(): documents = SimpleDirectoryReader(DATA_DIR).load_data() storage_context = StorageContext.from_defaults() index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir=PERSIST_DIR)

Query Handling

Handle Query Function: ```python def handlequery(query, lang): storagecontext = StorageContext.fromdefaults(persistdir=PERSISTDIR) index = loadindexfromstorage(storagecontext) chattextqamsgs = [ ( "user", """You are a Q&A assistant. Your main goal is to provide answers as accurately as possible, based on the instructions and context you have been given. If a question does not match the provided context or is outside the scope of the document, kindly advise the user to ask questions within the context of the document. Context: {contextstr} Question: {querystr} """ ) ] textqatemplate = ChatPromptTemplate.frommessages(chattextqamsgs)
```
query_engine = index.as_query_engine(text_qa_template=text_qa_template)
answer = query_engine.query(query)

if hasattr(answer, 'response'):
    return answer.response, lang
elif isinstance(answer, dict) and 'response' in answer:
    return answer['response'], lang
else:
    return "Sorry, I couldn't find an answer.", lang
```
```

Streamlit UI

Streamlit Interface: ```python st.title("(PDF) Information and Inference 🗞️") st.markdown("## Retrieval-Augmented Generation") st.markdown("Start chat ...🚀")

uploadedfile = st.fileuploader("Upload your PDF file", type="pdf") query = st.textinput("Ask me anything about the content of the PDF:") chathistory = st.empty()

if uploadedfile: pdfdisplay = processfile(uploadedfile) st.markdown(pdfdisplay, unsafeallow_html=True)

if query: lang = detect(query) response, lang = handlequery(query, lang) chathistory.text_area("Chat History", f"User: {query}\nAssistant: {response}", height=300) ```

Contributing

Contributions are welcome. Please fork the repository and submit a pull request for any feature requests or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Owner

Name: Arman Asgharpoor
Login: Armanasq
Kind: user
Company: University of Tehran

Website: https://armanasq.github.io/
Twitter: Armannearu
Repositories: 1
Profile: https://github.com/Armanasq

Avionics Engineer M.Sc. Space Engineering AI / Deep Learning

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Dependencies

.history/requirements_20240614150134.txt pypi

.history/requirements_20240614150220.txt pypi

huggingface-hub ==0.17.2
langdetect ==1.0.9
llama-index ==0.1.4
python-dotenv ==1.0.0
streamlit ==1.33.0

requirements.txt pypi

huggingface-hub ==0.17.2
langdetect ==1.0.9
llama-index ==0.1.4
python-dotenv ==1.0.0
streamlit ==1.33.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/armanasq/rag-pdf-streamlit

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

PDF-Insight: Streamlit Retrieval-Augmented Generation (RAG) 🗞️

Features

Setup

Prerequisites

Installation

Usage

Code Overview

Environment Setup

Llama Index Configuration

Data Handling

Query Handling

Streamlit UI

Contributing

License

Owner

GitHub Events

Total

Last Year

Dependencies