visarchpy

pipelines for the extraction and processing of visuals from PDFs

https://github.com/aidapt-a/visarchpy

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Keywords

ai computer-vision data-pipelines ocr pdf tu-delft

Last synced: 6 months ago · JSON representation ·

Repository

pipelines for the extraction and processing of visuals from PDFs

Basic Info

Host: GitHub
Owner: AiDAPT-A
License: mit
Language: Python
Default Branch: main
Homepage: https://visarchpy.readthedocs.io
Size: 3.79 MB

Statistics

Stars: 5
Watchers: 1
Forks: 1
Open Issues: 8
Releases: 4

Topics

ai computer-vision data-pipelines ocr pdf tu-delft

Created about 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

VisArchPy

Data pipelines for extraction, transformation and visualization of architectural visuals in Python. It extracts images embedded in PDF files, collects relevant metadata, and extracts visual features using the DinoV2 model. We ambition to make of this package Ai-powered tool with features for recorgnizing different types architectural visuals (types of buildings, structures, etc.). The package is still in development and we are working on adding more features and improving the existing ones. If you have any suggestions or questions, please open an issue in our GitHub repository.

Main Features

Extraction pipelines

Layout: pipeline for extracting metadata and visuals (images) from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.
OCR: pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.
LayoutOCR: pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis.

Metadata Extraction

Extraction of medatdata of extracted images (document page, image size)
Extraction of captions of images based on proximity to images and text-analysis using keywords.

Transformation utilities

Dino: pipeline for transforming images into visual features using the self-supervised learning in DinoV2.

Visualization utilities

Viz: an utility to create a bounding box plot. This plot provides an overview of the shapes and sizes of images in a data set.

Dependencies

Installion

After installing the dependencies, install VisArchPy using pip.

shell pip install visarchpy

Installing from source

Clone the repository. shell git clone https://github.com/AiDAPT-A/VisArchPy.git
Go to the root of the repository. shell cd VisArchPy/
Install the package using pip.

shell pip install .

Developers who intend to modify the sourcecode can install additional dependencies for test and documentation as follows.

Go to the root directory visarchpy/
Run:

shell pip install -e .[dev]

Usage

VisArchPy provides a command line interface to access its functionality. If you want to VisArchPy as a Python package consult the documentation.

To access the CLI:

shell visarch -h

To access a particular pipeline:

shell visarch [PIPELINE] [SUBCOMMAND]

For example, to run the layout pipeline using a single PDF file, do the following:

shell visarch layout from-file <path-to-pdf-file> <path-output-directory>

Use visarch [PIPELINE] [SUBCOMMAND] -h for help.

Results

Results from the data extraction pipelines (Layout, OCR, LayoutOCR) are save to the output directory. Results are organized as following:

shell 00000/ # results directory ├── pdf-001 # directory where images are saved to. One per PDF file ├── 00000-metadata.csv # extracted metadata as CSV ├── 00000-metadata.json # extracted metadata as JSON ├── 00000-settings.json # settings used by pipeline └── 00000.log # log file

Settings

The pipeline's settings determine how visual extraction from PDF files is performed. Settings must be passed as a JSON file on the CLI. Settings may must include all items listed below. The values showed belowed are the defaults.

Available settings

```python { "layout": { # setting for layout analysis "caption": { "offset": [ # distance used to locate captions 4, "mm" ], "direction": "down", # direction used to locate captions "keywords": [ # keywords used to find captions based on text analysis "figure", "caption", "figuur" ] }, "image": { # images smaller than these dimensions will be ignored "width": 120, "height": 120 } }, "ocr": { # settings for OCR analysis "caption": { "offset": [ 50, "px" ], "direction": "down", "keywords": [ "figure", "caption", "figuur" ] }, "image": { "width": 120, "height": 120 }, "resolution": 250, # dpi to convert PDF pages to images before OCR "resize": 30000 # total pixels. Larger OCR inputs are downsize to this before OCR "tesseract" : "--psm 1 --oem 3" # tesseract options } } ```

\ When no seetings are passed to a pipeline, the defaults are used. To print the default seetting to the terminal use:

shell visarch [PIPELINE] settings

Citation

Please cite this software using as follows:

Garcia Alvarez, M. G., Khademi, S., & Pohl, D. (2023). VisArchPy [Computer software]. https://github.com/AiDAPT-A/VisArchPy

Acknowlegdements

VisArchPy was develped thanks to the support provided by the Digital Competence Centre, Delft University of Technology.
Reseach Data Services, Delft University of Technology, The Netherlands.

Owner

Name: AiDAPT-A
Login: AiDAPT-A
Kind: organization

Repositories: 1
Profile: https://github.com/AiDAPT-A

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: VisArchPy
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Manuel Gilberto
    family-names: Garcia Alvarez
    email: m.g.garciaalvarez@tudelft.nl
    affiliation: Delft University of Technology
    orcid: 'https://orcid.org/0000-0003-1579-9989'
  - given-names: Seyran
    family-names: Khademi
    orcid: 'https://orcid.org/0000-0003-4623-3689'
    affiliation: Delft University of Techonology
    email: S.Khademi@tudelft.nl
  - email: D.Pohl@tudelft.nl
    given-names: Dennis
    family-names: Pohl
    affiliation: Delft University of Technology
    orcid: 'https://orcid.org/0000-0002-4847-1501'
repository-code: 'https://github.com/AiDAPT-A/VisArchPy'
abstract: >-
  Data pipelines for extraction, transformation and
  visualization of architectural visuals in Python.
keywords:
  - architectural visual
  - data pipeline
  - architecture
  - pdf
  - image
  - dinov2
license: MIT
date-released: '2023-12-04'

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Packages

Total packages: 1
Total downloads:
- pypi 19 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 5
Total maintainers: 1

pypi.org: visarchpy

Data pipelines for extraction, transformation and visualization of architectural visuals in Python.

Documentation: https://visarchpy.readthedocs.io
License: MIT
Latest release: 1.0.4
published about 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 19 Last month

Rankings

Dependent packages count: 10.1%

Average: 38.7%

Dependent repos count: 67.3%

Maintainers (1)

manuelgarciaalvarez

Last synced: 6 months ago

Dependencies

pyproject.toml pypi

Pillow *
PyMuPDF *
PyPDF2 *
beautifulsoup4 *
pandas *
pdfminer.six *
pymods *
requests *
shapely *

.github/workflows/python-publish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite

.github/workflows/unit-tests.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

docs/requirements.txt pypi

Sphinx ==7.2.6
sphinx-copybutton ==0.5.2
sphinx-rtd-theme ==1.3.0
sphinx-tabs ==3.4.4
sphinxcontrib-applehelp ==1.0.7
sphinxcontrib-devhelp ==1.0.5
sphinxcontrib-htmlhelp ==2.0.4
sphinxcontrib-jquery ==4.1
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.6
sphinxcontrib-serializinghtml ==1.1.9

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

visarchpy

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

VisArchPy

Main Features

Extraction pipelines

Metadata Extraction

Transformation utilities

Visualization utilities

Dependencies

Installion

Installing from source

Usage

Results

Settings

Citation

Acknowlegdements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: visarchpy

Rankings

Maintainers (1)

Dependencies