Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: semanticClimate
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 79.1 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File

Open in Colab

DOI Zenodo badge:

DOI

Citation:

Barbhuiya, S., Alwi, K. K., Kumari, R., S., A., Jawed, M., Simon, W., Yadav, G., & Murray-Rust, P. (2025). RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File (0.2). Zenodo. https://doi.org/10.5281/zenodo.16675979

Description:

This notebook demonstrates how to build a semantic question-answering system over scientific PDFs using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). It enables users to upload PDFs, extract content, embed it into a vector store, and query the document using natural language.

Key Features - PDF Upload & Text Extraction: Extract raw text from research papers using PyMuPDF - Text Chunking & Embeddings: Convert text into meaningful chunks and generate embeddings using models like sentence-transformers - RAG Pipeline: - Store document chunks in a FAISS vector database - Retrieve top-matching chunks based on user queries - Generate context-aware answers with an LLM - Natural Language Q&A: Ask questions like “What is the main finding?” or “What methods were used?” and get accurate answers drawn directly from the paper

Link to Notebook

Reviewers & review process: <Add reviewers and review process link>


Software citation information: CITATION.cff

License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: LICENSE

Owner

  • Name: semanticClimate
  • Login: semanticClimate
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Barbhuiya"
  given-names: "Shabnam"
  orcid: "https://orcid.org/0009-0004-0729-7385"
- family-names: "Alwi"
  given-names: "Kamran Khan"
  orcid: "" 
- family-names: "Kumari"
  given-names: "Renu"
  orcid: "https://orcid.org/0000-0002-9451-7814" 
- family-names: "S"
  given-names: "Anudev"
  orcid: "https://orcid.org/0009-0006-5487-4741"
- family-names: "Jawed"
  given-names: "Moobashara"
  orcid: "https://orcid.org/0009-0009-7488-4834"
- family-names: "Simon"
  given-names: "Worthington"
  orcid: "https://orcid.org/0000-0002-8579-9717"
- family-names: "Yadav"
  given-names: "Gitanjali"
  orcid: "https://orcid.org/0000-0001-6591-9964"
- family-names: "Murray-Rust"
  given-names: "Peter"
  orcid: "https://orcid.org/0000-0003-3386-3972"
title: "RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File"
version: 0.0.1
doi: 10.5281/zenodo.
date-released: 2025-08-01
url: "https://github.com/semanticClimate/"

GitHub Events

Total
  • Push event: 2
  • Create event: 2
Last Year
  • Push event: 2
  • Create event: 2