https://github.com/bytedance/dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary

Keywords

document-analysis layout-analysis ocr parser pdf pdf-converter pdf-parser python vlm-ocr

Last synced: 5 months ago · JSON representation

Repository

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

Basic Info

Host: GitHub
Owner: bytedance
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 10.9 MB

Statistics

Stars: 5,450
Watchers: 49
Forks: 433
Open Issues: 48
Releases: 0

Topics

document-analysis layout-analysis ocr parser pdf pdf-converter pdf-parser python vlm-ocr

Created 9 months ago · Last pushed 6 months ago

Metadata Files

Readme License

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model following an analyze-then-parse paradigm. This repository contains the demo code and pre-trained models for Dolphin.

📑 Overview

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two-stage approach:

🔍 Stage 1: Comprehensive page-level layout analysis by generating element sequence in natural reading order
🧩 Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task-specific prompts

Dolphin achieves promising performance across diverse page-level and element-level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

🚀 Demo

Try our demo on Demo-Dolphin.

📅 Changelog

🔥 2025.07.10 Released the Fox-Page Benchmark, a manually refined subset of the original Fox dataset. Download via: Baidu Yun | Google Drive.
🔥 2025.06.30 Added TensorRT-LLM support for accelerated inference！
🔥 2025.06.27 Added vLLM support for accelerated inference！
🔥 2025.06.13 Added multi-page PDF document parsing capability.
🔥 2025.05.21 Our demo is released at link. Check it out!
🔥 2025.05.20 The pretrained model and inference code of Dolphin are released.
🔥 2025.05.16 Our paper has been accepted by ACL 2025. Paper link: arXiv.

🛠️ Installation

Clone the repository: bash git clone https://github.com/ByteDance/Dolphin.git cd Dolphin
Install the dependencies: bash pip install -r requirements.txt
Download the pre-trained models using one of the following options:

Option A: Original Model Format (config-based)

Download from Baidu Yun or Google Drive and put them in the ./checkpoints folder.

Option B: Hugging Face Model Format

Visit our Huggingface model card, or download model by:

bash # Download the model from Hugging Face Hub git lfs install git clone https://huggingface.co/ByteDance/Dolphin ./hf_model # Or use the Hugging Face CLI pip install huggingface_hub huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model

⚡ Inference

Dolphin provides two inference frameworks with support for two parsing granularities: - Page-level Parsing: Parse the entire document page into a structured JSON and Markdown format - Element-level Parsing: Parse individual document elements (text, table, formula)

📄 Page-level Parsing

Using Original Framework (config-based)

```bash

Process a single document image

python demopage.py --config ./config/Dolphin.yaml --inputpath ./demo/pageimgs/page1.jpeg --save_dir ./results

Process a single document pdf

python demopage.py --config ./config/Dolphin.yaml --inputpath ./demo/pageimgs/page6.pdf --save_dir ./results

Process all documents in a directory

python demopage.py --config ./config/Dolphin.yaml --inputpath ./demo/pageimgs --savedir ./results

Process with custom batch size for parallel element decoding

python demopage.py --config ./config/Dolphin.yaml --inputpath ./demo/pageimgs --savedir ./results --maxbatchsize 8 ```

Using Hugging Face Framework

```bash

Process a single document image

python demopagehf.py --modelpath ./hfmodel --inputpath ./demo/pageimgs/page1.jpeg --savedir ./results

Process a single document pdf

python demopagehf.py --modelpath ./hfmodel --inputpath ./demo/pageimgs/page6.pdf --savedir ./results

Process all documents in a directory

python demopagehf.py --modelpath ./hfmodel --inputpath ./demo/pageimgs --save_dir ./results

Process with custom batch size for parallel element decoding

python demopagehf.py --modelpath ./hfmodel --inputpath ./demo/pageimgs --savedir ./results --maxbatch_size 16 ```

🧩 Element-level Parsing

Using Original Framework (config-based)

```bash

Process a single table image

python demoelement.py --config ./config/Dolphin.yaml --inputpath ./demo/elementimgs/table1.jpeg --element_type table

Process a single formula image

python demoelement.py --config ./config/Dolphin.yaml --inputpath ./demo/elementimgs/lineformula.jpeg --element_type formula

Process a single text paragraph image

python demoelement.py --config ./config/Dolphin.yaml --inputpath ./demo/elementimgs/para1.jpg --element_type text ```

Using Hugging Face Framework

```bash

Process a single table image

python demoelementhf.py --modelpath ./hfmodel --inputpath ./demo/elementimgs/table1.jpeg --elementtype table

Process a single formula image

python demoelementhf.py --modelpath ./hfmodel --inputpath ./demo/elementimgs/lineformula.jpeg --elementtype formula

Process a single text paragraph image

python demoelementhf.py --modelpath ./hfmodel --inputpath ./demo/elementimgs/para1.jpg --elementtype text ```

🌟 Key Features

🔄 Two-stage analyze-then-parse approach based on a single VLM
📊 Promising performance on document parsing tasks
🔍 Natural reading order element sequence generation
🧩 Heterogeneous anchor prompting for different document elements
⏱️ Efficient parallel parsing mechanism
🤗 Support for Hugging Face Transformers for easier integration

📮 Notice

Call for Bad Cases: If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the issue. We are continuously working to optimize and improve the model.

💖 Acknowledgement

We would like to acknowledge the following open-source projects that provided inspiration and reference for this work: - Donut - Nougat - GOT - MinerU - Swin - Hugging Face Transformers

📝 Citation

If you find this code useful for your research, please use the following BibTeX entry.

bibtex @article{feng2025dolphin, title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting}, author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and others}, journal={arXiv preprint arXiv:2505.14059}, year={2025} }

Star History

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 89
Total pull requests: 11
Average time to close issues: 11 days
Average time to close pull requests: about 15 hours
Total issue authors: 82
Total pull request authors: 10
Average comments per issue: 0.74
Average comments per pull request: 0.09
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 89
Pull requests: 11
Average time to close issues: 11 days
Average time to close pull requests: about 15 hours
Issue authors: 82
Pull request authors: 10
Average comments per issue: 0.74
Average comments per pull request: 0.09
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

williamlzw (3)
shengjie110 (2)
VincentG1234 (2)
xueanxi (2)
xaswq (2)
chaoStart (2)
fafner32 (1)
benzi604 (1)
alowkii (1)
james-li (1)
aceliuchanghong (1)
dukehxin (1)
nitin456 (1)
1148270327 (1)
berkus (1)

Pull Request Authors

hanyd2010 (2)
lemonguess (1)
Sam1320 (1)
patchy631 (1)
Ivan-Inby (1)
Aakashjammula (1)
xiaolonggee (1)
ktyptorio (1)
weburnit (1)
WinJayX (1)

https://github.com/bytedance/dolphin

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

📑 Overview

🚀 Demo

📅 Changelog

🛠️ Installation

⚡ Inference

📄 Page-level Parsing

Using Original Framework (config-based)

Process a single document image

Process a single document pdf

Process all documents in a directory

Process with custom batch size for parallel element decoding

Using Hugging Face Framework

Process a single document image

Process a single document pdf

Process all documents in a directory

Process with custom batch size for parallel element decoding

🧩 Element-level Parsing

Using Original Framework (config-based)

Process a single table image

Process a single formula image

Process a single text paragraph image

Using Hugging Face Framework

Process a single table image

Process a single formula image

Process a single text paragraph image

🌟 Key Features

📮 Notice

💖 Acknowledgement

📝 Citation

Star History

Owner

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels