layoutreader

A Faster LayoutReader Model based on LayoutLMv3, Sort OCR bboxes to reading order.

https://github.com/ppaanngggg/layoutreader

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Keywords

huggingface layoutlmv3 layoutreader transformers

Last synced: 10 months ago · JSON representation ·

Repository

A Faster LayoutReader Model based on LayoutLMv3, Sort OCR bboxes to reading order.

Basic Info

Host: GitHub
Owner: ppaanngggg
License: other
Language: Python
Default Branch: main
Homepage: https://huggingface.co/hantian/layoutreader
Size: 1.53 MB

Statistics

Stars: 268
Watchers: 5
Forks: 20
Open Issues: 1
Releases: 0

Topics

huggingface layoutlmv3 layoutreader transformers

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Funding License Citation

LayoutReader

Model Dataset

page_0 page_1

✨ Trusted by MinerU and MonkeyOCR.

Why this repo?

The original LayoutReader is published by Microsoft Research. It is based on LayoutLM, and use a seq2seq architecture to predict the reading order of the words in a document. There are several problems with the original repo: 1. Because it doesn't use transformers, there are lots of experiments in the code, and the code is not well-organized. It's hard to train and deploy. 2. seq2seq is too slow in production, I want to get the all predictions in one pass. 3. The pre-trained model's input is English word-level, but it's not the real case. The real inputs should be the spans extracted by PDF parser or OCR. 4. I want a multilingual model. I notice only use the bbox is only a little bit worse than bbox+text, so I want to train a model only use bbox, ignore the text.

What I did?

Refactor the codes, use LayoutLMv3ForTokenClassification of transformers to train and eval.
Offer a script turn the original word-level dataset into span-level dataset.
Implement a better post-processor to avoid duplicate predictions.
Release a pre-trained model fine-tuned from layoutlmv3-large

How to use?

```python from transformers import LayoutLMv3ForTokenClassification from v3.helpers import prepareinputs, boxes2inputs, parselogits

model = LayoutLMv3ForTokenClassification.from_pretrained("hantian/layoutreader")

list of [left, top, right, bottom], bboxes of spans, should be range from 0 to 1000

boxes = [[...], ...] inputs = boxes2inputs(boxes) inputs = prepareinputs(inputs, model) logits = model(**inputs).logits.cpu().squeeze(0) orders = parselogits(logits, len(boxes)) print(orders)

[0, 1, 2, ...]

```

Or you can python main.py to serve the model.

Dataset

Download Original Dataset

The original dataset can download from ReadingBank. More details can be found in the original repo.

Build Span-Level Dataset

bash unzip ReadingBank.zip python tools.py ./train/ train.jsonl.gz python tools.py ./dev/ dev.jsonl.gz python tools.py ./test/ test.jsonl.gz --src-shuffle-rate=0 python tools.py ./test/ test_shuf.jsonl.gz --src-shuffle-rate=1

Train & Eval

The core codes are in ./v3 folder. The train.sh and eval.py are the entrypoints.

bash bash train.sh python eval.py ../test.jsonl.gz hantian/layoutreader python eval.py ../test_shuf.jsonl.gz hantian/layoutreader

Span-Level Results

shuf means whether the input order is shuffled.
BlEU Idx is the BLEU score of predicted tokens' orders.
BLEU Text is the BLEU score of final merged text.

I only train the layout only model. And test on the span-level dataset. So the Heuristic Method result is quite different from the original word-level result. I mainly focus on the BLEU Text, it's only a bit lower than the original word-level result. But the speed is much faster.

| Method | shuf | BLEU Idx | BLEU Text | |----------------------------|------|----------|-----------| | Heuristic Method | no | 44.4 | 70.7 | | LayoutReader (layout only) | no | 94.9 | 97.5 | | LayoutReader (layout only) | yes | 94.8 | 97.4 |

Word-Level Results

My eval script

The layout only model is trained by myself using the original codes, and the public model is the pre-trained model. The layout only is nearly as good as the public model, and the shuf only has a little effect on the results.

Only test the first part of test dataset. Because it's too slow...

| Method | shuf | BLEU Idx | BLEU Text | |-----------------------------|------|----------|-----------| | Heuristic Method | no | 78.3 | 79.4 | | LayoutReader (layout only) | no | 98.0 | 98.2 | | LayoutReader (layout only) | yes | 97.8 | 98.0 | | LayoutReader (public model) | no | 98.0 | 98.3 |

Old eval script (copy from original paper)

Evaluation results of the LayoutReader on the reading order detection task, where the source-side of training/testing data is in the left-to-right and top-to-bottom order

| Method | Encoder | BLEU | ARD | |----------------------------|------------------------|--------|------| | Heuristic Method | - | 0.6972 | 8.46 | | LayoutReader (layout only) | LayoutLM (layout only) | 0.9732 | 2.31 | | LayoutReader | LayoutLM | 0.9819 | 1.75 |

Input order study with left-to-right and top-to-bottom inputs in evaluation, where r is the proportion of shuffled samples in training.

| Method | BLEU | BLEU | BLEU | ARD | ARD | ARD | |----------------------------|--------|--------|--------|--------|-------|------| | | r=100% | r=50% | r=0% | r=100% | r=50% | r=0% | | LayoutReader (layout only) | 0.9701 | 0.9729 | 0.9732 | 2.85 | 2.61 | 2.31 | | LayoutReader | 0.9765 | 0.9788 | 0.9819 | 2.50 | 2.24 | 1.75 |

Input order study with token-shuffled inputs in evaluation, where r is the proportion of shuffled samples in training.

| Method | BLEU | BLEU | BLEU | ARD | ARD | ARD | |----------------------------|--------|--------|--------|--------|-------|--------| | | r=100% | r=50% | r=0% | r=100% | r=50% | r=0% | | LayoutReader (layout only) | 0.9718 | 0.9714 | 0.1331 | 2.72 | 2.82 | 105.40 | | LayoutReader | 0.9772 | 0.9770 | 0.1783 | 2.48 | 2.46 | 72.94 |

Citation

If this model helps you, please cite it.

bibtex @software{Pang_Faster_LayoutReader_based_2024, author = {Pang, Hantian}, month = feb, title = {{Faster LayoutReader based on LayoutLMv3}}, url = {https://github.com/ppaanngggg/layoutreader}, version = {1.0.0}, year = {2024} }

Owner

Name: ppaanngggg
Login: ppaanngggg
Kind: user
Location: 中国
Company: Shanghai Tianrang.inc

Repositories: 30
Profile: https://github.com/ppaanngggg

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If this model helps you, please cite it."
authors:
- family-names: "Pang"
  given-names: "Hantian"
title: "Faster LayoutReader based on LayoutLMv3"
version: 1.0.0
date-released: 2024-02-28
url: "https://github.com/ppaanngggg/layoutreader"

GitHub Events

Total

Issues event: 22
Watch event: 171
Issue comment event: 68
Push event: 3
Fork event: 14

Last Year

Issues event: 22
Watch event: 171
Issue comment event: 68
Push event: 3
Fork event: 14

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 9
Total pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Total issue authors: 9
Total pull request authors: 0
Average comments per issue: 6.22
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 9
Pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 9
Pull request authors: 0
Average comments per issue: 6.22
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

myhloli (3)
chinaphilip (2)
shoshinL (1)
vikas-singh16 (1)
puppyapple (1)
GarinLL (1)
animebing (1)
xiaodongxi121 (1)
cnr0724 (1)
LittleNoob2333 (1)
milk333445 (1)
zhaomaoyuan (1)
gheyret (1)
huihuiustc (1)
firewox (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

deepspeed *
loguru *
nltk *
opencv-python *
rich *
tqdm *
transformers *
typer *