tei-reader

TEI Reader Python Library

https://github.com/centrefordigitalhumanities/tei-reader

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.7%) to scientific vocabulary

Keywords from Contributors

interactive mesh interpretability profiles sequences generic projection optim embedded hacking
Last synced: 10 months ago · JSON representation ·

Repository

TEI Reader Python Library

Basic Info
  • Host: GitHub
  • Owner: CentreForDigitalHumanities
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 159 KB
Statistics
  • Stars: 17
  • Watchers: 3
  • Forks: 3
  • Open Issues: 3
  • Releases: 1
Created over 8 years ago · Last pushed 12 months ago
Metadata Files
Readme License Citation

README.md

TEI-reader

DOI PyPI version

Python 3 Library for Reading the Text Content and Metadata of TEI P5 (Lite) Files

The library focuses on extracting the main text content from a file and providing the available metadata about the text.

It was originally created to support importing TEI formatted corpora into GrETEL, using the corpus2alpino library.

TL;DR

bash pip install tei-reader

```python from teireader import TeiReader reader = TeiReader() corpora = reader.readfile('example-tei.xml') # or read_string print(corpora.text)

show element attributes before the actual element text

print(corpora.tostring(lambda x, text: str(list(a.key + '=' + a.text for a in x.attributes)) + text)) ```

More Explanation

A reader can be opened using TeiReader(). It is then possible to either call read_file(file_name) or read_string(str). Both will return a Corpora object containing the following properties:

| Property | Description | | --- | --- | | corpora[] | A corpora can contain sub-corpora. | | documents[] | The Document objects directly part of this corpora. |

Corpora and Document all inherit from Element. In all objects deriving from this it is possible to call:

| Property | Description | | --- | --- | | attributes{} | Contain attributes applicable to this element. If an attribute contains attributes these are also returned. (e.g. encodingDesc::editorialDecl::normalization) | | text | Get the entire text content as str | | divisions[] | Recursively get all the text divisions in document order. If an element contains parts or text without tag. Those will be returned in order and wrapped with a PlaceholderDivision. | | all_parts[] | Recursively get the parts in document order constituting the entire text e.g. if something has emphasis, a footnote or is marked as foreign. Text without a container element will be returned in order and wrapped with a PlaceholderPart. | | parts[] | Get the parts in document order directly below the current element. |

Attribute, PlaceholderDivision and PlaceholderPart all support the same properties as Element.

Upload to PyPi

bash python setup.py sdist twine upload dist/*

Owner

  • Name: Centre for Digital Humanities
  • Login: CentreForDigitalHumanities
  • Kind: organization
  • Email: cdh@uu.nl
  • Location: Netherlands

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: tei_reader
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - name: >-
      Research Software Lab, Centre for Digital Humanities,
      Utrecht University
    website: >-
      https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/
    city: Utrecht
    country: NL
identifiers:
  - type: doi
    value: 10.5281/zenodo.10418496
repository-code: 'https://github.com/CentreForDigitalHumanities/tei-reader'
license: MIT

GitHub Events

Total
Last Year

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 31
  • Total Committers: 3
  • Avg Commits per committer: 10.333
  • Development Distribution Score (DDS): 0.226
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Sheean Spoel s****l@u****l 24
Ben Bonfil b****l@u****l 5
dependabot[bot] 4****] 2
Committer Domains (Top 20 + Academic)
uu.nl: 2

Issues and Pull Requests

Last synced: 10 months ago


Dependencies

requirements.in pypi
  • beautifulsoup4 *
  • lxml *
requirements.txt pypi
  • beautifulsoup4 ==4.6.0
  • lxml ==4.9.1
setup.py pypi
  • beautifulsoup4 *
  • lxml *