vision-parse

Parse PDFs into markdown using Vision LLMs

https://github.com/iamarunbrahma/vision-parse

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.0%) to scientific vocabulary

Keywords

document-parser pdf-parser pdf-to-markdown text-extraction

Last synced: 4 months ago · JSON representation ·

Repository

Parse PDFs into markdown using Vision LLMs

Basic Info

Host: GitHub
Owner: iamarunbrahma
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 374 KB

Statistics

Stars: 423
Watchers: 5
Forks: 58
Open Issues: 6
Releases: 2

Topics

document-parser pdf-parser pdf-to-markdown text-extraction

Created about 1 year ago · Last pushed 11 months ago

Metadata Files

Readme Contributing License Citation Codeowners

# Vision Parse ✨ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) [![Author: Arun Brahma](https://img.shields.io/badge/Author-Arun%20Brahma-purple)](https://github.com/iamarunbrahma) [![PyPI version](https://img.shields.io/pypi/v/vision-parse.svg)](https://pypi.org/project/vision-parse/) > Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code! [Getting Started](#-getting-started) • [Usage](#-usage) • [Supported Models](#-supported-models) • [Parameters](#-customization-parameters) • [Benchmarks](#-benchmarks)

🎯 Introduction

Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:

📝 Scanned Document Processing: Intelligently identifies and extracts text, tables, and LaTeX equations from scanned documents into markdown-formatted content with high precision
🎨 Advanced Content Formatting: Preserves LaTeX equations, hyperlinks, images, and document hierarchy for markdown-formatted content
🤖 Multi-LLM Support: Seamlessly integrates with multiple Vision LLM providers such as OpenAI, Gemini, and Llama for optimal accuracy and speed
📁 Local Model Hosting: Supports local model hosting with Ollama for secure, no-cost, private, and offline document processing

🚀 Getting Started

Prerequisites

🐍 Python >= 3.9
🖥️ Ollama (if you want to use local models)
🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)

Installation

Install the core package using pip (Recommended):

bash pip install vision-parse

Install the additional dependencies for OpenAI or Gemini:

```bash

To install all the additional dependencies

pip install 'vision-parse[all]' ```

Install the package from source:

bash pip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'

Setting up Ollama (Optional)

See Ollama Setup Guide on how to setup Ollama locally.

[!IMPORTANT] While Ollama provides free local model hosting, please note that vision models from Ollama can be significantly slower in processing documents and may not produce optimal results when handling complex PDF documents. For better accuracy and performance with complex layouts in PDF documents, consider using API-based models like OpenAI or Gemini.

Setting up Vision Parse with Docker (Optional)

Check out Docker Setup Guide on how to setup Vision Parse with Docker.

📚 Usage

Basic Example Usage

```python from vision_parse import VisionParser

Initialize parser

parser = VisionParser( modelname="llama3.2-vision:11b", # For local models, you don't need to provide the api key temperature=0.4, topp=0.5, imagemode="url", # Image mode can be "url", "base64" or None detailedextraction=False, # Set to True for more detailed extraction enable_concurrency=False, # Set to True for parallel processing )

Convert PDF to markdown

pdfpath = "inputdocument.pdf" # local path to your pdf file markdownpages = parser.convertpdf(pdf_path)

Process results

for i, pagecontent in enumerate(markdownpages): print(f"\n--- Page {i+1} ---\n{page_content}") ```

Customize Ollama configuration for better performance

```python from vision_parse import VisionParser

custom_prompt = """ Strictly preserve markdown formatting during text extraction from scanned document. """

Initialize parser with Ollama configuration

parser = VisionParser( modelname="llama3.2-vision:11b", temperature=0.7, topp=0.6, numctx=4096, imagemode="base64", customprompt=customprompt, detailedextraction=True, ollamaconfig={ "OLLAMANUMPARALLEL": 8, "OLLAMAREQUESTTIMEOUT": 240, }, enable_concurrency=True, )

Convert PDF to markdown

pdfpath = "inputdocument.pdf" # local path to your pdf file markdownpages = parser.convertpdf(pdf_path) ```

[!TIP] Please refer to FAQs for more details on how to improve the performance of locally hosted vision models.

API Models Usage (OpenAI, Azure OpenAI, Gemini, DeepSeek)

```python from vision_parse import VisionParser

Initialize parser with OpenAI model

parser = VisionParser( modelname="gpt-4o", apikey="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys temperature=0.7, topp=0.4, imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enableconcurrency=True, )

Initialize parser with Azure OpenAI model

parser = VisionParser( modelname="gpt-4o", imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enableconcurrency=True, openaiconfig={ "AZUREENDPOINTURL": "https://****.openai.azure.com/", # replace with your azure endpoint url "AZUREDEPLOYMENTNAME": "*******", # replace with azure deployment name, if needed "AZUREOPENAIAPIKEY": "***********", # replace with your azure openai api key "AZUREOPENAIAPI_VERSION": "2024-08-01-preview", # replace with latest azure openai api version }, )

Initialize parser with Google Gemini model

parser = VisionParser( modelname="gemini-1.5-flash", apikey="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey temperature=0.7, topp=0.4, imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enableconcurrency=True, )

Initialize parser with DeepSeek model

parser = VisionParser( modelname="deepseek-chat", apikey="your-deepseek-api-key", # Get the DeepSeek API key from https://platform.deepseek.com/apikeys temperature=0.7, topp=0.4, imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enable_concurrency=True, ) ```

✅ Supported Models

This package supports the following Vision LLM models:

🔧 Customization Parameters

Vision Parse offers several customization parameters to enhance document processing:

[!TIP] For more details on custom model configuration i.e. openai_config, gemini_config, and ollama_config; please refer to Model Configuration.

📊 Benchmarks

I conducted benchmarking to evaluate Vision Parse's performance against MarkItDown and Nougat. The benchmarking was conducted using a curated dataset of 100 diverse machine learning papers from arXiv, and the Marker library was used to generate the ground truth markdown formatted data.

Since there are no other ground truth data available for this task, I relied on the Marker library to generate the ground truth markdown formatted data.

Results

| Parser | Accuracy Score | |:-------:|:---------------:| | Vision Parse | 92% | | MarkItDown | 67% | | Nougat | 79% |

[!NOTE] I used gpt-4o model for Vision Parse to extract markdown content from the pdf documents. I have used model parameter settings as in scoring.py script. The above results may vary depending on the model you choose for Vision Parse and the model parameter settings.

Run Your Own Benchmarks

You can benchmark the performance of Vision Parse on your machine using your own dataset. Run scoring.py to generate a detailed comparison report in the output directory.

Install packages from requirements.txt: bash pip install --no-cache-dir -r benchmarks/requirements.txt
Run the benchmark script: ```bash

Change pdf_path to your pdf file path and benchmark_results_path to your desired output path

python benchmarks/scoring.py ```

🤝 Contributing

Contributions to Vision Parse are welcome! Whether you're fixing bugs, adding new features, or creating example notebooks, your help is appreciated. Please check out contributing guidelines for instructions on setting up the development environment, code style requirements, and the pull request process.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

Name: Arun Brahma
Login: iamarunbrahma
Kind: user

Repositories: 46
Profile: https://github.com/iamarunbrahma

Senior Machine Learning Engineer

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Brahma"
  given-names: "Arun"
title: "Vision-Parse: Parse PDFs into markdown using Vision LLMs"
date-released: 2024-12-31
url: "https://github.com/iamarunbrahma/vision-parse"

GitHub Events

Total

Create event: 28
Release event: 2
Issues event: 36
Watch event: 373
Delete event: 14
Issue comment event: 36
Push event: 76
Pull request review event: 4
Pull request event: 40
Fork event: 58

Last Year

Create event: 28
Release event: 2
Issues event: 36
Watch event: 373
Delete event: 14
Issue comment event: 36
Push event: 76
Pull request review event: 4
Pull request event: 40
Fork event: 58

Committers

Last synced: 8 months ago

All Time

Total Commits: 92
Total Committers: 4
Avg Commits per committer: 23.0
Development Distribution Score (DDS): 0.043

Past Year

Commits: 92
Committers: 4
Avg Commits per committer: 23.0
Development Distribution Score (DDS): 0.043

Top Committers

Name	Email	Commits
Arun Brahma	m**4@g**m	88
Mohamad Aljazaery	m**y@m**m	2
mark-beeby	4****y	1
Ankit Thakur	a**5@i**m	1

Committer Domains (Top 20 + Academic)

microsoft.com: 1

Issues and Pull Requests

Last synced: 4 months ago

All Time

Total issues: 23
Total pull requests: 38
Average time to close issues: 17 days
Average time to close pull requests: about 6 hours
Total issue authors: 20
Total pull request authors: 6
Average comments per issue: 1.13
Average comments per pull request: 0.26
Merged pull requests: 35
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 23
Pull requests: 38
Average time to close issues: 17 days
Average time to close pull requests: about 6 hours
Issue authors: 20
Pull request authors: 6
Average comments per issue: 1.13
Average comments per pull request: 0.26
Merged pull requests: 35
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

marmor7 (2)
csv610 (2)
marcelocecin (2)
FLYmoments (1)
ukos-git (1)
fabianonethousand (1)
wns-saitej (1)
kundeng (1)
MNicholasPro (1)
twoxfh (1)
drmetro09 (1)
pinkponk (1)
Ronimsenn (1)
MohamedAliRashad (1)
gh-wf (1)

Pull Request Authors

iamarunbrahma (30)
ankitthakur (2)
aldopareja (2)
mark-beeby (2)
maljazaery (2)
rushabh31 (1)

Top Labels

Issue Labels

enhancement (14) bug (8) stale (1) question (1)

Pull Request Labels

documentation (26) size:M (12) enhancement (10) size:L (9) size:XS (9) size:S (6) lgtm (4)

Packages

Total packages: 1
Total downloads:
- pypi 1,741 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 14
Total maintainers: 1

pypi.org: vision-parse

Parse PDF documents into markdown formatted content using Vision LLMs

Homepage: https://github.com/iamarunbrahma/vision-parse
Documentation: https://vision-parse.readthedocs.io/
License: MIT License
Latest release: 0.1.13
published 11 months ago

Versions: 14
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 1,741 Last month

Rankings

Dependent packages count: 9.9%

Forks count: 32.0%

Average: 34.8%

Stargazers count: 41.7%

Dependent repos count: 55.6%

Maintainers (1)

iamarunbrahma

Last synced: 4 months ago

Dependencies

.github/workflows/codeql.yml actions

actions/checkout v4 composite
github/codeql-action/analyze v3 composite
github/codeql-action/init v3 composite

Dockerfile docker

python 3.13-slim build

docker-compose.yml docker

benchmarks/requirements.txt pypi

markitdown *
nltk *
python-Levenshtein *
vision-parse *

uv.lock pypi

104 dependencies

.github/workflows/ci.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/release.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

pyproject.toml pypi

importlib-resources >=5.0.0; python_version < '3.9'
jinja2 >=3.0.0
ollama >=0.4.4
pydantic >=2.0.0
pymupdf >=1.22.0
tqdm >=4.65.0

vision-parse

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

🎯 Introduction

🚀 Getting Started

Prerequisites

Installation

To install all the additional dependencies

Setting up Ollama (Optional)

Setting up Vision Parse with Docker (Optional)

📚 Usage

Basic Example Usage

Initialize parser

Convert PDF to markdown

Process results

Customize Ollama configuration for better performance

Initialize parser with Ollama configuration

Convert PDF to markdown

API Models Usage (OpenAI, Azure OpenAI, Gemini, DeepSeek)

Initialize parser with OpenAI model

Initialize parser with Azure OpenAI model

Initialize parser with Google Gemini model

Initialize parser with DeepSeek model

✅ Supported Models

🔧 Customization Parameters

📊 Benchmarks

Results

Run Your Own Benchmarks

Change pdf_path to your pdf file path and benchmark_results_path to your desired output path

🤝 Contributing

📄 License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: vision-parse

Rankings

Maintainers (1)

Dependencies

Change `pdf_path` to your pdf file path and `benchmark_results_path` to your desired output path