https://github.com/armanasq/rag-pdf-streamlit
This repository contains a Streamlit app for retrieval-augmented generation on PDFs using HuggingFace models. Upload PDFs, ingest their content, and query information using natural language. The app supports GPU acceleration and multilingual queries with precise language detection. Ideal for efficient and accurate document analysis.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Repository
This repository contains a Streamlit app for retrieval-augmented generation on PDFs using HuggingFace models. Upload PDFs, ingest their content, and query information using natural language. The app supports GPU acceleration and multilingual queries with precise language detection. Ideal for efficient and accurate document analysis.
Basic Info
- Host: GitHub
- Owner: Armanasq
- Language: Python
- Default Branch: main
- Size: 7.81 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
PDF-Insight: Streamlit Retrieval-Augmented Generation (RAG) 🗞️
This repository contains a Streamlit application designed for performing retrieval-augmented generation on PDF documents using large language models and embeddings from HuggingFace. The application allows users to upload a PDF, ingest its content into a vector store, and query the document using natural language.
Features
- PDF Upload and Display: Upload and view PDF files within the application.
- Data Ingestion: Ingest PDF content into a persistent storage context for efficient querying.
- Query Handling: Perform natural language queries on the ingested PDF content.
- Language Detection: Detects the language of the query to ensure accurate processing.
Setup
Prerequisites
- Python 3.8+
- Streamlit
- HuggingFace Transformers
- Llama Index
- dotenv
- langdetect
Installation
Clone the repository:
bash git clone https://github.com/Armanasq/RAG-PDF-Streamli.git cd your-repoInstall dependencies:
bash pip install -r requirements.txtSet up environment variables: Create a
.envfile in the root directory and add your HuggingFace API token:env HUGGINGFACE_TOKEN=your_huggingface_api_token
Usage
Run the Streamlit application:
bash streamlit run streamlitRAG.pyUpload a PDF:
- Use the file uploader in the Streamlit interface to upload a PDF.
Ingest PDF Content:
- The content of the uploaded PDF will be ingested and stored for querying.
Query the PDF Content:
- Enter a query related to the PDF content in the text input box.
- The application will return relevant information based on the content of the PDF.
Code Overview
Environment Setup
- Load Environment Variables:
python load_dotenv() hf_token = os.getenv("HUGGINGFACE_TOKEN")
Llama Index Configuration
- Configure Llama Index to use GPU and HuggingFace models:
python Settings.llm = HuggingFaceInferenceAPI( model_name="mistralai/Mistral-7B-Instruct-v0.3", tokenizer_name="mistralai/Mistral-7B-Instruct-v0.3", context_window=5000, max_new_tokens=1024, generate_kwargs={"temperature": 0.1}, device=0 # Use GPU 0, set to -1 for CPU ) Settings.embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-small-en-v1.5", device=0 # Use GPU 0, set to -1 for CPU )
Data Handling
PDF Display Function:
python def display_pdf(file_path): with open(file_path, "rb") as f: base64_pdf = base64.b64encode(f.read()).decode('utf-8') pdf_display = f'<iframe src="data:application/pdf;base64,{base64_pdf}" width="100%" height="600" type="application/pdf"></iframe>' return pdf_displayData Ingestion:
python def data_ingestion(): documents = SimpleDirectoryReader(DATA_DIR).load_data() storage_context = StorageContext.from_defaults() index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir=PERSIST_DIR)
Query Handling
Handle Query Function: ```python def handlequery(query, lang): storagecontext = StorageContext.fromdefaults(persistdir=PERSISTDIR) index = loadindexfromstorage(storagecontext) chattextqamsgs = [ ( "user", """You are a Q&A assistant. Your main goal is to provide answers as accurately as possible, based on the instructions and context you have been given. If a question does not match the provided context or is outside the scope of the document, kindly advise the user to ask questions within the context of the document. Context: {contextstr} Question: {querystr} """ ) ] textqatemplate = ChatPromptTemplate.frommessages(chattextqamsgs)
query_engine = index.as_query_engine(text_qa_template=text_qa_template) answer = query_engine.query(query) if hasattr(answer, 'response'): return answer.response, lang elif isinstance(answer, dict) and 'response' in answer: return answer['response'], lang else: return "Sorry, I couldn't find an answer.", lang```
Streamlit UI
Streamlit Interface: ```python st.title("(PDF) Information and Inference 🗞️") st.markdown("## Retrieval-Augmented Generation") st.markdown("Start chat ...🚀")
uploadedfile = st.fileuploader("Upload your PDF file", type="pdf") query = st.textinput("Ask me anything about the content of the PDF:") chathistory = st.empty()
if uploadedfile: pdfdisplay = processfile(uploadedfile) st.markdown(pdfdisplay, unsafeallow_html=True)
if query: lang = detect(query) response, lang = handlequery(query, lang) chathistory.text_area("Chat History", f"User: {query}\nAssistant: {response}", height=300) ```
Contributing
Contributions are welcome. Please fork the repository and submit a pull request for any feature requests or bug fixes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Owner
- Name: Arman Asgharpoor
- Login: Armanasq
- Kind: user
- Company: University of Tehran
- Website: https://armanasq.github.io/
- Twitter: Armannearu
- Repositories: 1
- Profile: https://github.com/Armanasq
Avionics Engineer M.Sc. Space Engineering AI / Deep Learning
GitHub Events
Total
- Watch event: 1
- Fork event: 1
Last Year
- Watch event: 1
- Fork event: 1
Dependencies
- huggingface-hub ==0.17.2
- langdetect ==1.0.9
- llama-index ==0.1.4
- python-dotenv ==1.0.0
- streamlit ==1.33.0
- huggingface-hub ==0.17.2
- langdetect ==1.0.9
- llama-index ==0.1.4
- python-dotenv ==1.0.0
- streamlit ==1.33.0