rag-llm-with-pdf-xml
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: semanticClimate
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 79.1 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File
DOI Zenodo badge:
Citation:
Barbhuiya, S., Alwi, K. K., Kumari, R., S., A., Jawed, M., Simon, W., Yadav, G., & Murray-Rust, P. (2025). RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File (0.2). Zenodo. https://doi.org/10.5281/zenodo.16675979
Description:
This notebook demonstrates how to build a semantic question-answering system over scientific PDFs using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). It enables users to upload PDFs, extract content, embed it into a vector store, and query the document using natural language.
Key Features - PDF Upload & Text Extraction: Extract raw text from research papers using PyMuPDF - Text Chunking & Embeddings: Convert text into meaningful chunks and generate embeddings using models like sentence-transformers - RAG Pipeline: - Store document chunks in a FAISS vector database - Retrieve top-matching chunks based on user queries - Generate context-aware answers with an LLM - Natural Language Q&A: Ask questions like “What is the main finding?” or “What methods were used?” and get accurate answers drawn directly from the paper
Reviewers & review process: <Add reviewers and review process link>
Software citation information: CITATION.cff
License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: LICENSE
Owner
- Name: semanticClimate
- Login: semanticClimate
- Kind: organization
- Repositories: 3
- Profile: https://github.com/semanticClimate
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Barbhuiya" given-names: "Shabnam" orcid: "https://orcid.org/0009-0004-0729-7385" - family-names: "Alwi" given-names: "Kamran Khan" orcid: "" - family-names: "Kumari" given-names: "Renu" orcid: "https://orcid.org/0000-0002-9451-7814" - family-names: "S" given-names: "Anudev" orcid: "https://orcid.org/0009-0006-5487-4741" - family-names: "Jawed" given-names: "Moobashara" orcid: "https://orcid.org/0009-0009-7488-4834" - family-names: "Simon" given-names: "Worthington" orcid: "https://orcid.org/0000-0002-8579-9717" - family-names: "Yadav" given-names: "Gitanjali" orcid: "https://orcid.org/0000-0001-6591-9964" - family-names: "Murray-Rust" given-names: "Peter" orcid: "https://orcid.org/0000-0003-3386-3972" title: "RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File" version: 0.0.1 doi: 10.5281/zenodo. date-released: 2025-08-01 url: "https://github.com/semanticClimate/"
GitHub Events
Total
- Push event: 2
- Create event: 2
Last Year
- Push event: 2
- Create event: 2