https://github.com/copyleftdev/pdf_ai_poc

https://github.com/copyleftdev/pdf_ai_poc

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: copyleftdev
  • Language: Python
  • Default Branch: main
  • Size: 17.6 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme

README.md

PDF Data Extractor

This project extracts data from a PDF file using OpenAI's API and maps it to a predefined JSON schema.

Project Structure

pdf_extractor/ │ ├── config/ │ └── config.py │ ├── data/ │ └── schema.json │ ├── src/ │ ├── __init__.py │ ├── pdf_utils.py │ ├── openai_utils.py │ └── mapper.py │ ├── generate_sample_pdf.py ├── main.py ├── Pipfile ├── Pipfile.lock ├── ruff.toml └── README.md

Setup

Prerequisites

  1. Python 3.9 or higher: Ensure you have Python 3.9+ installed.
  2. Pipenv: Ensure you have pipenv installed.

Installation

  1. Clone the repository:

bash git clone https://github.com/yourusername/pdf_extractor.git cd pdf_extractor

  1. Install dependencies using pipenv:

bash pipenv install pipenv install reportlab # For generating the sample PDF pipenv install --dev ruff # For linting

  1. Set up the OpenAI API key:
  • Create a .env file in the root directory and add your OpenAI API key:

    OPEN_API_KEY=your-openai-api-key

Generate Sample PDF

Generate a sample PDF with test data to use for extraction:

bash pipenv run python generate_sample_pdf.py

Running the Code

To extract data from the PDF and map it to the JSON schema:

bash pipenv run python main.py

Linting with Ruff

To check your code for linting errors with ruff, run:

bash pipenv run ruff check .

To automatically fix linting errors with ruff, run:

bash pipenv run ruff --fix .

How It Works

  1. Configuration: The config/config.py file loads configuration settings and the OpenAI API key from environment variables.

  2. PDF Generation: The generate_sample_pdf.py script generates a sample PDF with email addresses, dates, and phone numbers.

  3. PDF Text Extraction: The src/pdf_utils.py file contains the extract_text_from_pdf function, which extracts text from the PDF.

  4. Data Extraction Using OpenAI: The src/openai_utils.py file contains the extract_data_with_openai function, which uses OpenAI's API to extract data from the extracted text based on predefined prompts.

  5. Mapping Data to JSON Schema: The src/mapper.py file contains the load_json_schema and map_to_json_schema functions, which load the JSON schema and map the extracted data to the schema.

  6. Main Script: The main.py script orchestrates the entire process: it loads the JSON schema, extracts text from the PDF, uses OpenAI's API to extract data, maps the data to the JSON schema, and prints the mapped data as JSON.

Example Output

After running main.py, the output should be a JSON object containing the extracted email addresses, dates, and phone numbers from the sample PDF:

json { "email": [ "example1@example.com", "example2@example.com" ], "date": [ "01/01/2023", "02/02/2023" ], "phone": [ "(123) 456-7890", "(987) 654-3210" ] }

Owner

  • Name: Donald Johnson
  • Login: copyleftdev
  • Kind: user
  • Location: Los Angeles

GitHub Events

Total
Last Year

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 2
  • Total Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Don Johnson dj@c****o 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

Pipfile pypi
  • openai *
  • pypdf2 *
  • reportlab *
  • requests *
  • ruff *
Pipfile.lock pypi
  • annotated-types ==0.7.0
  • anyio ==4.4.0
  • certifi ==2024.7.4
  • chardet ==5.2.0
  • charset-normalizer ==3.3.2
  • distro ==1.9.0
  • h11 ==0.14.0
  • httpcore ==1.0.5
  • httpx ==0.27.0
  • idna ==3.7
  • openai ==1.35.15
  • pillow ==10.4.0
  • pydantic ==2.8.2
  • pydantic-core ==2.20.1
  • pypdf2 ==3.0.1
  • reportlab ==4.2.2
  • requests ==2.32.3
  • ruff ==0.5.3
  • sniffio ==1.3.1
  • tqdm ==4.66.4
  • typing-extensions ==4.12.2
  • urllib3 ==2.2.2