pagexml-tools
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: knaw-huc
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Size: 2.48 MB
Statistics
- Stars: 12
- Watchers: 11
- Forks: 4
- Open Issues: 0
- Releases: 15
Metadata Files
README.md
pagexml-tools
Utility functions for reading PageXML files
installing
using poetry
commandline
poetry add pagexml-tools
using pip
commandline
pip install pagexml-tools
Using
PageXML-tools contains functions for parsing and for a range of analysis tasks.
Parsing PageXML files and the Physical Document model
There is a tutorial that demonstrates the physical document model API
PageXML-tools contains basic functionality for parsing a PageXML file that returns a document model representing the content of the file. The HTR/OCR process that generates PageXML, recognises text in an image of a physical document.
```python from pagexml.parser import parsepagexmlfile
pagexmlfile = "path/to/pagexmlfile.xml"
pagedoc = parsepagexmlfile(pagexmlfile)
a page document has an ID
print(page_doc.id)
print descriptive statistics
print(page_doc.stats)
iterative over text regions and lines
for tr in pagedoc.textregions: # a textregion has an ID and a bounding box derived from its coordinates print(tr.id, tr.coords.box) # a textregion can have sub-text_regions and lines for line in tr.lines: # a line has an ID, coordinates and text print(line.id, line.coords.box, line.text) ```
In addition to the basic parsing and handling of PageXML output, there is functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file (tutorial),
- searching in text (keyword in context, keywords or fuzzy search)
- reading and working with tables (table processing)
- classifying physical document types in a large set of PageXML documents (tutorial),
- checking the quality of the HTR/OCR process (tutorial),
- comparing subsets (tutorial),
- identifying document sections in sequences of PageXML documents (tutorial),
- turning text lines into running text (tutorial),
- supporting different reading orders (tutorial),
- reinterpreting and restructuring text regions and lines (tutorial),
- turning physical structure into logical structure,
USAGE | CONTRIBUTING | LICENSE
Owner
- Name: KNAW Humanities Cluster
- Login: knaw-huc
- Kind: organization
- Location: Netherlands
- Website: https://huc.knaw.nl/
- Repositories: 61
- Profile: https://github.com/knaw-huc
Connecting people, research, data and collections. - IISG/Huygens Institute/Meertens Institute
Citation (CITATION.cff)
cff-version: 1.2.0 message: If you use this software, please cite it as below. authors: - family-names: Koolen given-names: Marijn orcid: https://orcid.org/0000-0002-0301-2029 - family-names: Buitendijk given-names: Bram orcid: https://orcid.org/0000-0002-3755-5929 title: pagexml-tools version: 0.7.1 date-released: 2025-05-26
GitHub Events
Total
- Create event: 4
- Issues event: 2
- Release event: 2
- Watch event: 2
- Delete event: 2
- Issue comment event: 5
- Push event: 19
- Pull request event: 6
- Fork event: 2
Last Year
- Create event: 4
- Issues event: 2
- Release event: 2
- Watch event: 2
- Delete event: 2
- Issue comment event: 5
- Push event: 19
- Pull request event: 6
- Fork event: 2
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 5
- Total pull requests: 18
- Average time to close issues: 13 days
- Average time to close pull requests: 29 days
- Total issue authors: 4
- Total pull request authors: 5
- Average comments per issue: 1.4
- Average comments per pull request: 0.39
- Merged pull requests: 18
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 4
- Average time to close issues: 7 days
- Average time to close pull requests: 2 days
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- proycon (1)
- carschno (1)
- JavaStudentAlex (1)
Pull Request Authors
- marijnkoolen (14)
- proycon (2)
- KayWP (1)
- carschno (1)
- LvanWissen (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 304 last-month
- Total dependent packages: 1
- Total dependent repositories: 1
- Total versions: 16
- Total maintainers: 1
pypi.org: pagexml-tools
Utility functions for reading PageXML files
- Homepage: https://github.com/knaw-huc/pagexml
- Documentation: https://pagexml-tools.readthedocs.io/
- License: MIT
-
Latest release: 0.7.1
published 10 months ago
Rankings
Maintainers (1)
Dependencies
- icecream ^2.1.2
- numpy ^1.22.3
- python ^3.8,<3.11
- python-dateutil ^2.8.2
- scipy ^1.7.0
- xmltodict ^0.12.0
- abatilo/actions-poetry v2.0.0 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- brotli ==1.0.9
- brotlicffi ==1.0.9.2
- cffi ==1.15.1
- colorama ==0.4.6
- contourpy ==1.0.7
- cycler ==0.11.0
- fonttools ==4.39.3
- fuzzy-search ==1.6.0
- importlib-resources ==5.12.0
- inflate64 ==0.3.1
- kiwisolver ==1.4.4
- matplotlib ==3.7.1
- multivolumefile ==0.2.3
- numpy ==1.24.2
- packaging ==23.0
- pandas ==1.5.3
- pillow ==9.5.0
- psutil ==5.9.4
- py7zr ==0.20.4
- pybcj ==1.0.1
- pycparser ==2.21
- pycryptodomex ==3.17
- pyparsing ==3.0.9
- pyppmd ==1.0.0
- python-dateutil ==2.8.2
- pytz ==2023.3
- pyzstd ==0.15.6
- scipy ==1.10.1
- seaborn ==0.12.2
- six ==1.16.0
- texttable ==1.6.7
- tqdm ==4.65.0
- xmltodict ==0.12.0
- zipp ==3.15.0