pdf-extractor

NLP-powered tool designed to extract data from PDF documents. Using Optical Character Recognition (OCR) technology and GPT language model, this tool offers the capability to read, interpret, and convert unstructured data in PDFs into structured, usable data formats and provides the output in an Excel sheet.

https://github.com/kaufmannb/pdf-extractor

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

NLP-powered tool designed to extract data from PDF documents. Using Optical Character Recognition (OCR) technology and GPT language model, this tool offers the capability to read, interpret, and convert unstructured data in PDFs into structured, usable data formats and provides the output in an Excel sheet.

Basic Info
  • Host: GitHub
  • Owner: kaufmannb
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 376 KB
Statistics
  • Stars: 7
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Citation

README.md

PDF Data Extractor

The PDF Report Data Extractor is a Python application that enables you to extract specific data from PDFs. It processes multiple PDF files located in an input folder, generates answers for user-defined questions using the OpenAI GPT model, and saves the extracted information in an Excel spreadsheet in the output folder.

Features

  • Extracts specific data from PDF files
  • Supports batch processing of multiple PDF files
  • Utilizes the OpenAI GPT model for generating answers
  • Saves the extracted information in an Excel spreadsheet
  • User-friendly GUI for selecting input and output folders, providing question and instruction inputs, and initiating the process

Prerequisites

Before running the application, ensure that you have the following prerequisites:

  • Python 3.10 or later installed on your system
  • Required Python packages installed (specified in requirements.txt)
  • An OpenAI API key for utilizing the OpenAI GPT model (get your key at OpenAI Platform)

Installation

You have two options for installing and using the PDF Data Extractor:

Option 1: Running the Python Application

  1. Clone the repository or download the source code.

  2. Navigate to the project directory using the command line.

  3. Create and activate a virtual environment (optional but recommended).

  4. Install the required dependencies by running the following command: pip install -r requirements.txt

Option 2: Using the Executable File

  1. Download the executable file from PDF Extractor (680 MB).

  2. Run the executable file to install the application.

Usage

  1. Launch the application by running the main_app.py file:

  2. The application GUI will appear.

  3. Click the "Browse Input Folder" button to select the folder containing the PDF files to analyze.

  4. Click the "Browse Output Folder" button to choose the folder where the final Excel file will be saved.

  5. Enter your OpenAI API key in the provided field. This key is necessary for generating answers using the OpenAI GPT model.

  6. Enter a specific question related to the data you want to extract from the PDF reports.

  7. Optionally, provide instructions for how the GPT model should process and structure the answer based on the PDF content.

  8. Click the "Process Files" button to start the extraction process. The application will process the PDF files, generate answers for the specified question, and save the extracted information in the output folder as an Excel spreadsheet.

  9. Monitor the progress of the processing through the displayed status label.

  10. Once the processing is complete, a success message will be displayed, indicating that the Excel file has been generated.

Workflow

Contributing

Contributions to the PDF Report Data Extractor project are welcome! If you find any issues or have suggestions for improvement, please feel free to submit a pull request or open an issue on GitHub.

License

This project is licensed under the MIT License.

Owner

  • Name: Basil Kaufmann
  • Login: kaufmannb
  • Kind: user
  • Location: New York
  • Company: Icahn School of Medicine at Mount Sinai

Urologist and tech enthusiast.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Kaufmann"
  given-names: "Basil"
  orcid: "https://orcid.org/0000-0001-6965-449X"
- family-names: "Gorin"
  given-names: "Michael"
  orcid: "https://orcid.org/0000-0002-8315-6603"
title: "PDF Data Extractor"
version: 1.0
doi: 
date-released: 2023-07-05
url: "https://github.com/kaufmannb/PDF-Data-Extractor"

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Dependencies

requirements.txt pypi
  • PyInstaller ==5.13.0
  • PyMuPDF ==1.22.5
  • Requests ==2.31.0
  • fitz ==0.0.1.dev2
  • flatten_json ==0.1.13
  • jsonlines ==3.1.0
  • nltk ==3.8.1
  • numpy ==1.23.5
  • openai ==0.27.8
  • openpyxl ==3.1.2
  • pandas ==2.0.3
  • pdfminer ==20191125
  • scikit_learn ==1.3.0
  • selenium ==4.10.0
  • tensorflow_hub ==0.13.0
  • transformers ==4.30.2