Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: knaw-huc
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 2.48 MB
Statistics
  • Stars: 12
  • Watchers: 11
  • Forks: 4
  • Open Issues: 0
  • Releases: 15
Created almost 5 years ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License Citation

README.md

pagexml-tools

GitHub Actions Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status PyPI PyPI - Python Version

Utility functions for reading PageXML files

installing

using poetry

commandline poetry add pagexml-tools

using pip

commandline pip install pagexml-tools

Using

PageXML-tools contains functions for parsing and for a range of analysis tasks.

Parsing PageXML files and the Physical Document model

There is a tutorial that demonstrates the physical document model API

PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.

```python from pagexml.parser import parsepagexmlfile

pagexmlfile = "path/to/pagexmlfile.xml"

pagedoc = parsepagexmlfile(pagexmlfile)

a page document has an ID

print(page_doc.id)

print descriptive statistics

print(page_doc.stats)

iterative over text regions and lines

for tr in pagedoc.textregions: # a textregion has an ID and a bounding box derived from its coordinates print(tr.id, tr.coords.box) # a textregion can have sub-text_regions and lines for line in tr.lines: # a line has an ID, coordinates and text print(line.id, line.coords.box, line.text) ```

In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:

  • reading sets of PageXML files from a archive (tar, zip) file (tutorial),
  • searching in text (keyword in context, keywords or fuzzy search)
  • reading and working with tables (table processing)
  • classifying physical document types in a large set of PageXML documents (tutorial),
  • checking the quality of the HTR/OCR process (tutorial),
  • comparing subsets (tutorial),
  • identifying document sections in sequences of PageXML documents (tutorial),
  • turning text lines into running text (tutorial),
  • supporting different reading orders (tutorial),
  • reinterpreting and restructuring text regions and lines (tutorial),
  • turning physical structure into logical structure,

USAGE | CONTRIBUTING | LICENSE

Owner

  • Name: KNAW Humanities Cluster
  • Login: knaw-huc
  • Kind: organization
  • Location: Netherlands

Connecting people, research, data and collections. - IISG/Huygens Institute/Meertens Institute

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
authors:
- family-names: Koolen
  given-names: Marijn
  orcid: https://orcid.org/0000-0002-0301-2029
- family-names: Buitendijk
  given-names: Bram
  orcid: https://orcid.org/0000-0002-3755-5929
title: pagexml-tools
version: 0.7.1
date-released: 2025-05-26

GitHub Events

Total
  • Create event: 4
  • Issues event: 2
  • Release event: 2
  • Watch event: 2
  • Delete event: 2
  • Issue comment event: 5
  • Push event: 19
  • Pull request event: 6
  • Fork event: 2
Last Year
  • Create event: 4
  • Issues event: 2
  • Release event: 2
  • Watch event: 2
  • Delete event: 2
  • Issue comment event: 5
  • Push event: 19
  • Pull request event: 6
  • Fork event: 2

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 5
  • Total pull requests: 18
  • Average time to close issues: 13 days
  • Average time to close pull requests: 29 days
  • Total issue authors: 4
  • Total pull request authors: 5
  • Average comments per issue: 1.4
  • Average comments per pull request: 0.39
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 4
  • Average time to close issues: 7 days
  • Average time to close pull requests: 2 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • proycon (1)
  • carschno (1)
  • JavaStudentAlex (1)
Pull Request Authors
  • marijnkoolen (14)
  • proycon (2)
  • KayWP (1)
  • carschno (1)
  • LvanWissen (1)
Top Labels
Issue Labels
invalid (1) enhancement (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 304 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 16
  • Total maintainers: 1
pypi.org: pagexml-tools

Utility functions for reading PageXML files

  • Versions: 16
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 304 Last month
Rankings
Dependent packages count: 4.7%
Downloads: 9.8%
Average: 15.1%
Forks count: 19.1%
Stargazers count: 20.3%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 7 months ago

Dependencies

pyproject.toml pypi
  • icecream ^2.1.2
  • numpy ^1.22.3
  • python ^3.8,<3.11
  • python-dateutil ^2.8.2
  • scipy ^1.7.0
  • xmltodict ^0.12.0
.github/workflows/python-package.yml actions
  • abatilo/actions-poetry v2.0.0 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
docs/requirements.txt pypi
  • brotli ==1.0.9
  • brotlicffi ==1.0.9.2
  • cffi ==1.15.1
  • colorama ==0.4.6
  • contourpy ==1.0.7
  • cycler ==0.11.0
  • fonttools ==4.39.3
  • fuzzy-search ==1.6.0
  • importlib-resources ==5.12.0
  • inflate64 ==0.3.1
  • kiwisolver ==1.4.4
  • matplotlib ==3.7.1
  • multivolumefile ==0.2.3
  • numpy ==1.24.2
  • packaging ==23.0
  • pandas ==1.5.3
  • pillow ==9.5.0
  • psutil ==5.9.4
  • py7zr ==0.20.4
  • pybcj ==1.0.1
  • pycparser ==2.21
  • pycryptodomex ==3.17
  • pyparsing ==3.0.9
  • pyppmd ==1.0.0
  • python-dateutil ==2.8.2
  • pytz ==2023.3
  • pyzstd ==0.15.6
  • scipy ==1.10.1
  • seaborn ==0.12.2
  • six ==1.16.0
  • texttable ==1.6.7
  • tqdm ==4.65.0
  • xmltodict ==0.12.0
  • zipp ==3.15.0