vision-parse
Parse PDFs into markdown using Vision LLMs
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.0%) to scientific vocabulary
Keywords
Repository
Parse PDFs into markdown using Vision LLMs
Basic Info
Statistics
- Stars: 423
- Watchers: 5
- Forks: 58
- Open Issues: 6
- Releases: 2
Topics
Metadata Files
README.md
🎯 Introduction
Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:
- 📝 Scanned Document Processing: Intelligently identifies and extracts text, tables, and LaTeX equations from scanned documents into markdown-formatted content with high precision
- 🎨 Advanced Content Formatting: Preserves LaTeX equations, hyperlinks, images, and document hierarchy for markdown-formatted content
- 🤖 Multi-LLM Support: Seamlessly integrates with multiple Vision LLM providers such as OpenAI, Gemini, and Llama for optimal accuracy and speed
- 📁 Local Model Hosting: Supports local model hosting with Ollama for secure, no-cost, private, and offline document processing
🚀 Getting Started
Prerequisites
- 🐍 Python >= 3.9
- 🖥️ Ollama (if you want to use local models)
- 🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)
Installation
Install the core package using pip (Recommended):
bash
pip install vision-parse
Install the additional dependencies for OpenAI or Gemini:
```bash
To install all the additional dependencies
pip install 'vision-parse[all]' ```
Install the package from source:
bash
pip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'
Setting up Ollama (Optional)
See Ollama Setup Guide on how to setup Ollama locally.
[!IMPORTANT] While Ollama provides free local model hosting, please note that vision models from Ollama can be significantly slower in processing documents and may not produce optimal results when handling complex PDF documents. For better accuracy and performance with complex layouts in PDF documents, consider using API-based models like OpenAI or Gemini.
Setting up Vision Parse with Docker (Optional)
Check out Docker Setup Guide on how to setup Vision Parse with Docker.
📚 Usage
Basic Example Usage
```python from vision_parse import VisionParser
Initialize parser
parser = VisionParser( modelname="llama3.2-vision:11b", # For local models, you don't need to provide the api key temperature=0.4, topp=0.5, imagemode="url", # Image mode can be "url", "base64" or None detailedextraction=False, # Set to True for more detailed extraction enable_concurrency=False, # Set to True for parallel processing )
Convert PDF to markdown
pdfpath = "inputdocument.pdf" # local path to your pdf file markdownpages = parser.convertpdf(pdf_path)
Process results
for i, pagecontent in enumerate(markdownpages): print(f"\n--- Page {i+1} ---\n{page_content}") ```
Customize Ollama configuration for better performance
```python from vision_parse import VisionParser
custom_prompt = """ Strictly preserve markdown formatting during text extraction from scanned document. """
Initialize parser with Ollama configuration
parser = VisionParser( modelname="llama3.2-vision:11b", temperature=0.7, topp=0.6, numctx=4096, imagemode="base64", customprompt=customprompt, detailedextraction=True, ollamaconfig={ "OLLAMANUMPARALLEL": 8, "OLLAMAREQUESTTIMEOUT": 240, }, enable_concurrency=True, )
Convert PDF to markdown
pdfpath = "inputdocument.pdf" # local path to your pdf file markdownpages = parser.convertpdf(pdf_path) ```
[!TIP] Please refer to FAQs for more details on how to improve the performance of locally hosted vision models.
API Models Usage (OpenAI, Azure OpenAI, Gemini, DeepSeek)
```python from vision_parse import VisionParser
Initialize parser with OpenAI model
parser = VisionParser( modelname="gpt-4o", apikey="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys temperature=0.7, topp=0.4, imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enableconcurrency=True, )
Initialize parser with Azure OpenAI model
parser = VisionParser( modelname="gpt-4o", imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enableconcurrency=True, openaiconfig={ "AZUREENDPOINTURL": "https://****.openai.azure.com/", # replace with your azure endpoint url "AZUREDEPLOYMENTNAME": "*******", # replace with azure deployment name, if needed "AZUREOPENAIAPIKEY": "***********", # replace with your azure openai api key "AZUREOPENAIAPI_VERSION": "2024-08-01-preview", # replace with latest azure openai api version }, )
Initialize parser with Google Gemini model
parser = VisionParser( modelname="gemini-1.5-flash", apikey="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey temperature=0.7, topp=0.4, imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enableconcurrency=True, )
Initialize parser with DeepSeek model
parser = VisionParser( modelname="deepseek-chat", apikey="your-deepseek-api-key", # Get the DeepSeek API key from https://platform.deepseek.com/apikeys temperature=0.7, topp=0.4, imagemode="url", detailedextraction=False, # Set to True for more detailed extraction enable_concurrency=True, ) ```
✅ Supported Models
This package supports the following Vision LLM models:
| Model Name | Provider Name | |:------------:|:----------:| | gpt-4o | OpenAI | | gpt-4o-mini | OpenAI | | gemini-1.5-flash | Google | | gemini-2.0-flash-exp | Google | | gemini-1.5-pro | Google | | llava:13b | Ollama | | llava:34b | Ollama | | llama3.2-vision:11b | Ollama | | llama3.2-vision:70b | Ollama | | deepseek-r1:32b | Ollama | | deepseek-chat | DeepSeek |
🔧 Customization Parameters
Vision Parse offers several customization parameters to enhance document processing:
| Parameter | Description | Value Type | |:---------:|:-----------:|:-------------:| | modelname | Name of the Vision LLM model to use | str | | customprompt | Define custom prompt for the model and it will be used as a suffix to the default prompt | str | | ollamaconfig | Specify custom configuration for Ollama client initialization | dict | | openaiconfig | Specify custom configuration for OpenAI, Azure OpenAI or DeepSeek client initialization | dict | | geminiconfig | Specify custom configuration for Gemini client initialization | dict | | imagemode | Sets the image output format for the model i.e. if you want image url in markdown content or base64 encoded image | str | | detailedextraction | Enable advanced content extraction to extract complex information such as LaTeX equations, tables, images, etc. | bool | | enableconcurrency | Enable parallel processing of multiple pages in a PDF document in a single request | bool |
[!TIP] For more details on custom model configuration i.e.
openai_config,gemini_config, andollama_config; please refer to Model Configuration.
📊 Benchmarks
I conducted benchmarking to evaluate Vision Parse's performance against MarkItDown and Nougat. The benchmarking was conducted using a curated dataset of 100 diverse machine learning papers from arXiv, and the Marker library was used to generate the ground truth markdown formatted data.
Since there are no other ground truth data available for this task, I relied on the Marker library to generate the ground truth markdown formatted data.
Results
| Parser | Accuracy Score | |:-------:|:---------------:| | Vision Parse | 92% | | MarkItDown | 67% | | Nougat | 79% |
[!NOTE] I used gpt-4o model for Vision Parse to extract markdown content from the pdf documents. I have used model parameter settings as in
scoring.pyscript. The above results may vary depending on the model you choose for Vision Parse and the model parameter settings.
Run Your Own Benchmarks
You can benchmark the performance of Vision Parse on your machine using your own dataset. Run scoring.py to generate a detailed comparison report in the output directory.
Install packages from requirements.txt:
bash pip install --no-cache-dir -r benchmarks/requirements.txtRun the benchmark script: ```bash
Change
pdf_pathto your pdf file path andbenchmark_results_pathto your desired output pathpython benchmarks/scoring.py ```
🤝 Contributing
Contributions to Vision Parse are welcome! Whether you're fixing bugs, adding new features, or creating example notebooks, your help is appreciated. Please check out contributing guidelines for instructions on setting up the development environment, code style requirements, and the pull request process.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
Owner
- Name: Arun Brahma
- Login: iamarunbrahma
- Kind: user
- Repositories: 46
- Profile: https://github.com/iamarunbrahma
Senior Machine Learning Engineer
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Brahma" given-names: "Arun" title: "Vision-Parse: Parse PDFs into markdown using Vision LLMs" date-released: 2024-12-31 url: "https://github.com/iamarunbrahma/vision-parse"
GitHub Events
Total
- Create event: 28
- Release event: 2
- Issues event: 36
- Watch event: 373
- Delete event: 14
- Issue comment event: 36
- Push event: 76
- Pull request review event: 4
- Pull request event: 40
- Fork event: 58
Last Year
- Create event: 28
- Release event: 2
- Issues event: 36
- Watch event: 373
- Delete event: 14
- Issue comment event: 36
- Push event: 76
- Pull request review event: 4
- Pull request event: 40
- Fork event: 58
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Arun Brahma | m****4@g****m | 88 |
| Mohamad Aljazaery | m****y@m****m | 2 |
| mark-beeby | 4****y | 1 |
| Ankit Thakur | a****5@i****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 23
- Total pull requests: 38
- Average time to close issues: 17 days
- Average time to close pull requests: about 6 hours
- Total issue authors: 20
- Total pull request authors: 6
- Average comments per issue: 1.13
- Average comments per pull request: 0.26
- Merged pull requests: 35
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 23
- Pull requests: 38
- Average time to close issues: 17 days
- Average time to close pull requests: about 6 hours
- Issue authors: 20
- Pull request authors: 6
- Average comments per issue: 1.13
- Average comments per pull request: 0.26
- Merged pull requests: 35
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- marmor7 (2)
- csv610 (2)
- marcelocecin (2)
- FLYmoments (1)
- ukos-git (1)
- fabianonethousand (1)
- wns-saitej (1)
- kundeng (1)
- MNicholasPro (1)
- twoxfh (1)
- drmetro09 (1)
- pinkponk (1)
- Ronimsenn (1)
- MohamedAliRashad (1)
- gh-wf (1)
Pull Request Authors
- iamarunbrahma (30)
- ankitthakur (2)
- aldopareja (2)
- mark-beeby (2)
- maljazaery (2)
- rushabh31 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,741 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 14
- Total maintainers: 1
pypi.org: vision-parse
Parse PDF documents into markdown formatted content using Vision LLMs
- Homepage: https://github.com/iamarunbrahma/vision-parse
- Documentation: https://vision-parse.readthedocs.io/
- License: MIT License
-
Latest release: 0.1.13
published 11 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v4 composite
- github/codeql-action/analyze v3 composite
- github/codeql-action/init v3 composite
- python 3.13-slim build
- markitdown *
- nltk *
- python-Levenshtein *
- vision-parse *
- 104 dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- importlib-resources >=5.0.0; python_version < '3.9'
- jinja2 >=3.0.0
- ollama >=0.4.4
- pydantic >=2.0.0
- pymupdf >=1.22.0
- tqdm >=4.65.0