laypa
Layout analysis to find layout elements in documents (similar to P2PaLA)
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 10 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Repository
Layout analysis to find layout elements in documents (similar to P2PaLA)
Basic Info
- Host: GitHub
- Owner: stefanklut
- License: mit
- Language: Python
- Default Branch: main
- Size: 213 MB
Statistics
- Stars: 19
- Watchers: 4
- Forks: 8
- Open Issues: 7
- Releases: 0
Metadata Files
README.md
Laypa
Laypa: A Novel Framework for Applying Segmentation Networks to Historical Documents
HIP'23 paper: https://doi.org/10.1145/3604951.3605520 <!-- TODO Add ArXiv paper --> ArXiv paper: Coming soon!
Part of the Loghi pipeline
Laypa is a segmentation network, with the goal of finding regions (paragraph, page number, etc.) and baselines in documents. The current approach is using a ResNet backbone and a feature pyramid head, which made pixel wise classifications. The models are built using the detectron2 framework. The baselines and region classifications are then made available for further processing. This post-processing turn the classification into instances. So that they can be used by other programs (OCR/HTR), either as masks or directly as PageXML.
Table of Contents
Tested Environments
Developed using the following software and hardware: <!-- TODO Change to recent information, maybe write small script te generate this information --> | Operating System | Python | PyTorch | Cudatoolkit | GPU | CPU | Success | | ----------------------------------------------------------------- | ------ | ------- | ----------- | ------------------------------------- | ------------------------------------ | ------------------ | | Ubuntu 22.04.4 LTS (Linux-6.5.0-28-generic-x8664-with-glibc2.35) | 3.12.3 | 2.3.0 | 12.1 | NVIDIA GeForce RTX 3080 Ti Laptop GPU | 12th Gen Intel(R) Core(TM) i9-12900H | :whitecheck_mark: |
Click here to show all tested environments
More coming soon Run [`tooling/collect_env_info.py`][collect_env_info_link] to retrieve your environment information, and add them via [pull request][pull_request_link]. | Operating System | Python | PyTorch | Cudatoolkit | GPU | CPU | Success | | ----------------------------------------------------------------- | ------ | ------- | ----------- | ------------------------------------- | ------------------------------------ | ------------------ | | Ubuntu 22.04.4 LTS (Linux-6.5.0-28-generic-x86_64-with-glibc2.35) | 3.12.3 | 2.3.0 | 12.1 | NVIDIA GeForce RTX 3080 Ti Laptop GPU | 12th Gen Intel(R) Core(TM) i9-12900H | :white_check_mark: |Setup
The recommended way of running Laypa is inside a conda environment. To ensure easier compatibility a method of building a docker is also provided.
To start clone the github repo to your local machine using either HTTPS:
sh
git clone https://github.com/stefanklut/laypa.git
Or using SSH:
sh
git clone git@github.com:stefanklut/laypa.git
And make laypa the working directory:
sh
cd laypa
Conda
If not already installed, install either conda or miniconda (install instructions), or mamba (install instructions).
The required packages are listed in the environment.yml file. The environment can be automatically created using the following commands.
Using conda/miniconda:
sh
conda env create -f environment.yml
Using mamba:
sh
mamba env create -f environment.yml
When running Laypa always activate the conda environment
sh
conda activate laypa
Docker
If not already installed, install the Docker Engine (install instructions). The docker environment can most easily be build with the provided script.
Download from dockerhub
Laypa now has a release on dockerhub. Using the docker of loghi/docker.laypa, should pull the corresponding laypa docker directly from docker hub. If this fails from some reason it can be pulled manually from here. If it is outdated or requires differences to the source code, please try the Manual Installation.
Manual Installation
Building the docker using the provided script:
sh
./buildImage.sh <PATH_TO_LAYPA>
Or the multistage build with some profiler tools taken out (might be smaller):
sh
./buildImage.multistage.sh <PATH_TO_LAYPA>
Click for manual docker install instructions (not recommended)
First copy the Laypa directory to the temporary docker directory: ```sh tmp_dir=$(mktemp -d) cp -r -TClick for minikube install instructions
Minikube is local Kubernetes, allowing you to test the Laypa tools in a Kubernetes environment. If not already installed start with installing minikube ([install instructions][minikube_install_link]) If the docker images have already been built the minikube can run them straight away. To do so, start minikube without any special arguments: ```sh minikube start ``` Afterwards the docker for Laypa can be added to the running minikube instance using the following command (assuming the Laypa docker was built under the name loghi/docker.laypa): ```sh minikube image load loghi/docker.laypa ``` It is also possible to build the Laypa docker using the minikube docker instance. This means minikube will need access to the Laypa code. As it stand, this is current still done using a copy command from the local storage. In order to do so start the minikube with the mount argument: ```sh minikube start --mount ``` This will make the machines filesystem available to minikube. Then ssh into the running minikube: ```sh minikube ssh ``` Within the ssh minikube go to the location of the laypa where the host `/home/When successful the docker image should be available under the name loghi/docker.laypa. This can be verified using the following command:
sh
docker image ls
And checking if loghi/docker.laypa is present in the list of built images.
Pretrained models
Some initial pretrained models can be found here.
Dataset(s)
The dataset used for training requires images combined with ground truth PageXML. For structure the PageXML needs to be inside a directory one level down from the images. The dataset can be split over multiple directories, with the image paths specified in a .txt file. The structure should look as follows:
training_data
page
image1.xml
image2.xml
image3.xml
...
image1.jpg
image2.jpg
image3.jpg
...
Where the image and PageXML filename stems should match image1.jpg <-> image1.xml. For the .txt based dataset absolute paths to the images are recommended. The structure for the data used as validation is the same as that for training.
When running inference the images you want processed should be in a single directory. With the images directly under the root folder as follows:
inference_data
image1.jpg
image2.jpg
image3.jpg
...
Some dataset that should work with laypa are listed below, some preprocessing may be require: - cBAD - VOC and notarial deeds - OHG - Bozen
Training
Three things are required to train a model using train.py.
1. A config file, See configs/segmentation for examples of config files and their contents.
2. Ground truth training/validation data in the form of images and their corresponding PageXML. The training/validation data can be provided by giving either a .txt file containing image paths, the image paths themselves, or the path of a directory containing the images.
Required arguments:
sh
python train.py \
-c/--config <CONFIG> \
-t/--train <TRAIN [TRAIN ...]> \
-v/--val <VAL [VAL ...]>
Click to see all arguments
Optional arguments: ```sh python train.py \ -c/--config CONFIG \ -t/--train TRAIN [TRAIN ...] \ -v/--val VAL [VAL ...] \ [--tmp_dir TMP_DIR] \ [--keep_tmp_dir] \ [--num-gpus NUM_GPUS] \ [--num-machines NUM_MACHINES] \ [--machine-rank MACHINE_RANK] \ [--dist-url DIST_URL] \ [--opts OPTS [OPTS ...]] ``` The optional arguments are shown using square brackets. The `--tmp_dir` parameter specifies a folder in which to store temporary files. While the `--keep_tmp_dir` parameter prevents the temporary files from being deleted after a run (mostly for debugging). The remaining arguments are all for training with multiple GPUs or on multiple nodes. `--num-gpus` specifies the number of GPUs per machine. `--num-machines` specifies the number of nodes in the network. `--machine-rank` gives a node a unique number. `--dist-url` is the URL for the PyTorch distributed backend. The final parameter `--opts` allows you to change values specified in the config files. For example, `--opts SOLVER.IMS_PER_BATCH 8` sets the batch size to 8.As indicated by the trailing dots multiple training sets can be passed to the training model at once. This can also be done using the train argument multiple types. The .txt files can also be mixed with the directories. For example:
```sh
Pass multiple directories at once
python train.py -c config.yml -t data/trainingdir1 data/trainingdir2 -v data/validation_set
Pass multiple directories with multiple arguments
python train.py -c config.yml -t data/trainingdir1 -t data/trainingdir2 -v data/validation_set
Mix training directory with txt file
python train.py -c config.yml -t data/trainingdir -t data/trainingfile.txt -v data/validation_set ```
[!TIP] See the tips and tricks section below for more information on how to train a model.
Tips and Tricks
- When a models output is close to what you want, but not quite there yet, training the model from scratch can be a waste of time. Instead, you can finetune the existing model with ground truth that better matches your use case. This can be done by changing the `MODEL.WEIGHTS` parameter in the config file to the path of the existing model. Or by using the `--opts` parameter to change the weights path (for example `--opts MODEL.WEIGHTSInference
To run the trained model on images without ground truth, the images need to be in a single directory. The output consists of either PageXML in the case of regions or a mask in the other cases. This mask can then be processed using other tools to turn the pixel predictions into valid PageXML (for example on baselines). As stated, the regions are turned into polygons for the PageXML within the program already.
How to run the Laypa inference individually will be explained first, and how to run it with the full scripts that include the conversion from images to PageXML with come after.
Without External Processing
To just run the Laypa inference in inference.py, you need three things:
1. A config file, See configs/segmentation for examples of config files and their contents.
2. The data can be provided by giving either a .txt file containing image paths, the image paths themselves, or the path of a directory containing the images.
3. A location to which the processed files can be written. The directory will be created if it does not exist yet.
Required arguments
sh
python inference.py \
-c/--config CONFIG \
-i/--input INPUT \
-o/--output OUTPUT
Click to see all arguments
Optional arguments:
sh
python inference.py \
-c/--config CONFIG \
-i/--input INPUT \
-o/--output OUTPUT
[--opts OPTS [OPTS ...]]
The optional arguments are shown using square brackets. The final parameter --opts allows you to change values specified in the config files. For example, --opts SOLVER.IMS_PER_BATCH 8 sets the batch size to 8.
List values have to be overridden by encapsulating the whole list with quotes like --opts PREPROCESS.REGION.RECTANGLE_REGIONS '["Photo"]'
To set what weights the model should use, the MODEL.WEIGHTS parameter in the config file should be set to the path of the weights file. If the weights are not in the config file, the weights can be set using the --opts parameter.
An example of how to call the inference.py command is given below:
sh
python inference.py -c config.yml -i data/inference_dir -o results_dir
If setting the weights using the --opts parameter the command would look as follows:
sh
python inference.py -c config.yml -i data/inference_dir -o results_dir --opts MODEL.WEIGHTS <PATH_TO_WEIGHTS>
[!TIP] See the tips and tricks section below for more information on how to run the model.
Tips and Tricks
- You can run the model with less GPU requirement by using AMP (Automatic Mixed Precision). This can be done by setting the `MODEL.AMP_TEST.ENABLED` parameter to `True` in the config file. Or by using the `--opts` parameter to change the weights path (for example `--opts MODEL.AMP_TEST.ENABLED True`). - Specify what GPU the model the model should run on using the environment variable `CUDA_VISIBLE_DEVICES`. This should be in front of the `python inference.py` command. For example, `CUDA_VISIBLE_DEVICES=0 python inference.py -c config.yml -i data/inference_dir -o results_dir`. This will run the model on GPU 0. To run on CPU use `CUDA_VISIBLE_DEVICES="" python inference.py -c config.yml -i data/inference_dir -o results_dir`.With External Java Processing
Examples of running the full pipeline (with processing of baselines) are present in the scripts directory. These files make the assumption that the docker images for both Laypa and the loghi-tooling (Java post-processing) are available on your machine. The script will also try and verify this. The Laypa docker image needs to be build with the pretrained models included.
To run the scripts only two thing are needed: 1. A directory with images to be processed. 2. A location to which the processed files can be written. The directory will be created if it does not exist yet.
Required arguments:
sh
./scripts/pipeline.sh <input> <output>
Click to see all arguments
Optional arguments:
sh
./scripts/pipeline.sh \
<input> \
<output> \
-g/--gpu GPU
The required arguments are shown using angle brackets. The --gpu parameter specifies what GPU(s) is accessible to the docker containers. The default is all.
The positional arguments input and output refer to the input and output directory. An example of running the one of the pipelines is shown below:
sh
./scripts/pipeline.sh inference_dir results_dir
Flask Server
The Flask Server is set up to run the inference code in a Kubernetes environment. To run the Flask API run the start_flask.sh application with the environment variables set. This can generally be set when running a docker, which can set the environment variables beforehand depending on the docker internal file structure. To quickly test locally you can run the start_flask_local.sh application, which sets the environment variables at runtime.
The flask server will run on port 5000 and can be called from outside using a curl command. When testing on a localhost the command will look as follows:
sh
curl -X POST 'http://localhost:5000/predict' -F image=@<PATH_TO_IMAGE> -F identifier=<identifier> -F model=<MODEL_FOLDER_NAME>
The required form information is the image (image) that should be processed. A given identifier to differentiate multiple runs/tests (identifier). The identifier can be any string, but it is recommended to use a UUID or a timestamp to ensure uniqueness. And finally which config and weights to use (model). The config and weights are saved in a folder, this folder name is what needs to be provided. This folder should be relative to the LAYPA_MODEL_BASE_PATH, given as an environment variable. So if the LAYPA_MODEL_BASE_PATH is set to /models and the model is stored in /models/version1 then the model path is version1. The model folder should contain the config and weights files. The config file should be named config.yml and the weight file should end in .pth.
To monitor a specific request, the identifier can be used to check the status of the request. This can be done using the following commands:
sh
curl -X GET 'http://localhost:5000/status_info/<identifier>'
curl -X GET 'http://localhost:5000/status_info' -F identifier=<identifier>
This will return information about the request, such as the status of the request, the time it took to process the request, and what error occurred (if any). This information will be returned in JSON format.
To view more general overview of the history or performance of the server, the following command can be used:
sh
curl -X GET 'http://localhost:5000/prometheus'
This will give back the standard prometheus metrics. As well as the current number of images in the queue, the number of images processed, the number of exceptions encountered, and information about how long images are in the queue and how long it took to process them. If you just want the current number of images in the queue, you can use the following command:
sh
curl -X GET 'http://localhost:5000/queue_size'
For kubernetes checks there is a health check available. This can be done using the following command:
sh
curl -X GET 'http://localhost:5000/health'
The health check will return a 200 OK if the server is running and a 500 if the server is not running. The health check can be used to check if the server is running and ready to process requests.
Docker API
To use the docker image as an API service, we recommend using docker compose. The docker compose file is provided in the docker-compose.yml file. The docker compose file can be run using the following command:
sh
docker-compose up
Then request the API (in this example using curl) with the same arguments as the Flask server (see Flask Server).
The model base path is set in the docker-compose.yml file.
Tutorial
For a small tutorial using some concrete examples see the tutorial directory.
Evaluation
The Laypa repository also contains a few tools used to evaluate the results generated by the model.
The first tool is a visual comparison between the predictions of the model and the ground truth. This is done as an overlay of the classes over the original image. The overlay class names and colors are taken from the dataset catalog. The tool to do this is visualization.py. The visualization has almost the same arguments as the training command (train.py).
Required arguments:
sh
python tooling/visualization.py \
-c/--config CONFIG \
-i/--input INPUT [INPUT ...] \
Click to see all arguments
Optional arguments: ```sh python tooling/visualization.py \ -c/--config CONFIG \ -i/--input INPUT [INPUT ...] \ [-o/--output OUTPUT] \ [--tmp_dir TMP_DIR] \ [--keep_tmp_dir] [--opts OPTS [OPTS ...]] \ [--sorted] \ [--save SAVE] ``` The optional arguments are shown using square brackets. The `-o/output` parameter specifies the output directory for the visualization masks. The `--tmp_dir` parameter specifies a folder in which to store temporary files. While the `--keep_tmp_dir` parameter prevents the temporary files from being deleted after a run (mostly for debugging). The final parameter `--opts` allows you to change values specified in the config files. For example, `--opts SOLVER.IMS_PER_BATCH 8` sets the batch size to 8. The `--sorted` parameter sorts the images based on the order in the operating system. The `--save` parameter specifies what type of file the visualization should be saved as. The options are "pred" for the prediction, "gt" for the ground truth, "both" for both the prediction and the ground truth and "all" for all of the previous. If just `--save` is given the default is "all".Example of running visualization.py:
sh
python tooling/visualization.py -c config.yml -i input_dir
The visualization.py will then open a window with both the prediction and the ground truth side by side (if the ground truth exists). Allowing for easier comparison. The visualization masks are created in the same way the preprocessing converts PageXML to masks.
The second tool validation.py is used to get the validation scores of a model. This is done by comparing the prediction of the model to the ground truth. The validation scores are the Intersection over Union (IoU) and Accuracy (Acc) scores. The tool requires the input directory (--input) where there is also a page folder inside the input folder. The page folder should contain the xmls with the ground truth baselines/regions. To run the validation tool use the following command:
Required arguments:
sh
python tooling/validation.py \
-c/--config CONFIG \
-i/--input INPUT
Click to see all arguments
Optional arguments: ```sh python validation.py \ -c/--config CONFIG \ -i/--input INPUT \ [--opts OPTS [OPTS ...]] ``` The optional arguments are shown using square brackets. The final parameter `--opts` allows you to change values specified in the config files. For example, `--opts MODEL.WEIGHTSClick to see all arguments
Optional arguments: ```sh python tooling/xml_comparison.py \ -g/--gt GT [GT ...] \ -i/--input INPUT [INPUT ...] \ [-m/--mode {baseline,region,start,end,separator,baseline_separator}] \ [--regions REGIONS [REGIONS ...]] \ [--merge_regions [MERGE_REGIONS]] \ [--region_type REGION_TYPE [REGION_TYPE ...]] \ [-w/--line_width LINE_WIDTH] ``` The optional arguments are shown using square brackets. The `--mode` parameter specifies what type of prediction the model has to do. If the mode is region, the `--regions` argument specifies which regions need to be extracted from the PageXML (for example "page-number"). The `--merge_regions` then specifies if any of these regions need to be merged. This could mean converting "insertion" into "resolution" since they are talking about the same thing `resolution:insertion`. The final region argument is `--region_type` which can specify the region type of a region. In the other modes lines are used. The line arguments are `--line_width`, which specifies the line width, and `--line_color`, which specifies the line color.The final tool is a program for showing the PageXML as mask images. This can help with showing how the PageXML regions and baseline look. This can be done in gray scale, color, or as a colored overlay over the original image. This tool is located in the xml_viewer.py file. It requires an input directory (--input) argument and output directory (--output) argument.
Required arguments:
sh
python tooling/xml_viewer.py \
-c/--config CONFIG \
-i/--input INPUT [INPUT ...] \
-o/--output OUTPUT [OUTPUT ...]
Click to see all arguments
Optional arguments: ```sh python tooling/xml_viewer.py \ -c/--config CONFIG \ -i/--input INPUT [INPUT ...] \ -o/--output OUTPUT [OUTPUT ...] \ [--opts OPTS [OPTS ...]] \ [-t/--output_type {gray,color,overlay}] ``` The optional arguments are shown using square brackets. The parameter `--opts` allows you to change values specified in the config files. The `--output_type` parameter specifies which type ofLicense
Distributed under the MIT License. See LICENSE for more information.
Contact
This project was made while working at the KNAW Humanities Cluster Digital Infrastructure
Issues
Please report any bugs or errors that you find to the issues page, so that they can be looked into. Try to see if an issue with the same problem/bug is not still open. Feature requests should also be done through the issues page.
Contributions
If you discover a bug or missing feature that you would like to help with please feel free to send a pull request.
Owner
- Name: Stefan Klut
- Login: stefanklut
- Kind: user
- Repositories: 16
- Profile: https://github.com/stefanklut
GitHub Events
Total
- Issues event: 8
- Watch event: 2
- Delete event: 9
- Issue comment event: 22
- Push event: 122
- Pull request event: 13
- Fork event: 3
- Create event: 8
Last Year
- Issues event: 8
- Watch event: 2
- Delete event: 9
- Issue comment event: 22
- Push event: 122
- Pull request event: 13
- Fork event: 3
- Create event: 8
Dependencies
- actions/checkout v3 composite
- peter-evans/create-pull-request v5 composite
- psf/black stable composite
- condaforge/mambaforge latest build
- loghi/docker.laypa latest
- cuda-toolkit
- flask
- gunicorn
- imagesize
- llvm-openmp <16
- matplotlib
- natsort
- numpy 1.*
- opencv
- pillow
- pip 23.*
- prometheus_client
- python 3.*
- pytorch
- pytorch-cuda
- scikit-image
- scipy
- shapely
- timm
- torchvision
- tqdm
- ultralytics