https://github.com/copyleftdev/pdf_ai_poc
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: copyleftdev
- Language: Python
- Default Branch: main
- Size: 17.6 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
PDF Data Extractor
This project extracts data from a PDF file using OpenAI's API and maps it to a predefined JSON schema.
Project Structure
pdf_extractor/
│
├── config/
│ └── config.py
│
├── data/
│ └── schema.json
│
├── src/
│ ├── __init__.py
│ ├── pdf_utils.py
│ ├── openai_utils.py
│ └── mapper.py
│
├── generate_sample_pdf.py
├── main.py
├── Pipfile
├── Pipfile.lock
├── ruff.toml
└── README.md
Setup
Prerequisites
- Python 3.9 or higher: Ensure you have Python 3.9+ installed.
- Pipenv: Ensure you have
pipenvinstalled.
Installation
- Clone the repository:
bash
git clone https://github.com/yourusername/pdf_extractor.git
cd pdf_extractor
- Install dependencies using
pipenv:
bash
pipenv install
pipenv install reportlab # For generating the sample PDF
pipenv install --dev ruff # For linting
- Set up the OpenAI API key:
Create a
.envfile in the root directory and add your OpenAI API key:OPEN_API_KEY=your-openai-api-key
Generate Sample PDF
Generate a sample PDF with test data to use for extraction:
bash
pipenv run python generate_sample_pdf.py
Running the Code
To extract data from the PDF and map it to the JSON schema:
bash
pipenv run python main.py
Linting with Ruff
To check your code for linting errors with ruff, run:
bash
pipenv run ruff check .
To automatically fix linting errors with ruff, run:
bash
pipenv run ruff --fix .
How It Works
Configuration: The
config/config.pyfile loads configuration settings and the OpenAI API key from environment variables.PDF Generation: The
generate_sample_pdf.pyscript generates a sample PDF with email addresses, dates, and phone numbers.PDF Text Extraction: The
src/pdf_utils.pyfile contains theextract_text_from_pdffunction, which extracts text from the PDF.Data Extraction Using OpenAI: The
src/openai_utils.pyfile contains theextract_data_with_openaifunction, which uses OpenAI's API to extract data from the extracted text based on predefined prompts.Mapping Data to JSON Schema: The
src/mapper.pyfile contains theload_json_schemaandmap_to_json_schemafunctions, which load the JSON schema and map the extracted data to the schema.Main Script: The
main.pyscript orchestrates the entire process: it loads the JSON schema, extracts text from the PDF, uses OpenAI's API to extract data, maps the data to the JSON schema, and prints the mapped data as JSON.
Example Output
After running main.py, the output should be a JSON object containing the extracted email addresses, dates, and phone numbers from the sample PDF:
json
{
"email": [
"example1@example.com",
"example2@example.com"
],
"date": [
"01/01/2023",
"02/02/2023"
],
"phone": [
"(123) 456-7890",
"(987) 654-3210"
]
}
Owner
- Name: Donald Johnson
- Login: copyleftdev
- Kind: user
- Location: Los Angeles
- Repositories: 39
- Profile: https://github.com/copyleftdev
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Don Johnson | dj@c****o | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- openai *
- pypdf2 *
- reportlab *
- requests *
- ruff *
- annotated-types ==0.7.0
- anyio ==4.4.0
- certifi ==2024.7.4
- chardet ==5.2.0
- charset-normalizer ==3.3.2
- distro ==1.9.0
- h11 ==0.14.0
- httpcore ==1.0.5
- httpx ==0.27.0
- idna ==3.7
- openai ==1.35.15
- pillow ==10.4.0
- pydantic ==2.8.2
- pydantic-core ==2.20.1
- pypdf2 ==3.0.1
- reportlab ==4.2.2
- requests ==2.32.3
- ruff ==0.5.3
- sniffio ==1.3.1
- tqdm ==4.66.4
- typing-extensions ==4.12.2
- urllib3 ==2.2.2