Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.7%) to scientific vocabulary
Keywords from Contributors
Repository
TEI Reader Python Library
Basic Info
- Host: GitHub
- Owner: CentreForDigitalHumanities
- License: mit
- Language: Python
- Default Branch: main
- Size: 159 KB
Statistics
- Stars: 17
- Watchers: 3
- Forks: 3
- Open Issues: 3
- Releases: 1
Metadata Files
README.md
TEI-reader
Python 3 Library for Reading the Text Content and Metadata of TEI P5 (Lite) Files
The library focuses on extracting the main text content from a file and providing the available metadata about the text.
It was originally created to support importing TEI formatted corpora into GrETEL, using the corpus2alpino library.
TL;DR
bash
pip install tei-reader
```python from teireader import TeiReader reader = TeiReader() corpora = reader.readfile('example-tei.xml') # or read_string print(corpora.text)
show element attributes before the actual element text
print(corpora.tostring(lambda x, text: str(list(a.key + '=' + a.text for a in x.attributes)) + text)) ```
More Explanation
A reader can be opened using TeiReader(). It is then possible to either call read_file(file_name) or read_string(str). Both will return a Corpora object containing the following properties:
| Property | Description |
| --- | --- |
| corpora[] | A corpora can contain sub-corpora. |
| documents[] | The Document objects directly part of this corpora. |
Corpora and Document all inherit from Element. In all objects deriving from this it is possible to call:
| Property | Description |
| --- | --- |
| attributes{} | Contain attributes applicable to this element. If an attribute contains attributes these are also returned. (e.g. encodingDesc::editorialDecl::normalization) |
| text | Get the entire text content as str |
| divisions[] | Recursively get all the text divisions in document order. If an element contains parts or text without tag. Those will be returned in order and wrapped with a PlaceholderDivision. |
| all_parts[] | Recursively get the parts in document order constituting the entire text e.g. if something has emphasis, a footnote or is marked as foreign. Text without a container element will be returned in order and wrapped with a PlaceholderPart. |
| parts[] | Get the parts in document order directly below the current element. |
Attribute, PlaceholderDivision and PlaceholderPart all support the same properties as Element.
Upload to PyPi
bash
python setup.py sdist
twine upload dist/*
Owner
- Name: Centre for Digital Humanities
- Login: CentreForDigitalHumanities
- Kind: organization
- Email: cdh@uu.nl
- Location: Netherlands
- Website: https://cdh.uu.nl/
- Repositories: 39
- Profile: https://github.com/CentreForDigitalHumanities
Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: tei_reader
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- name: >-
Research Software Lab, Centre for Digital Humanities,
Utrecht University
website: >-
https://cdh.uu.nl/centre-for-digital-humanities/research-software-lab/
city: Utrecht
country: NL
identifiers:
- type: doi
value: 10.5281/zenodo.10418496
repository-code: 'https://github.com/CentreForDigitalHumanities/tei-reader'
license: MIT
GitHub Events
Total
Last Year
Committers
Last synced: 11 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Sheean Spoel | s****l@u****l | 24 |
| Ben Bonfil | b****l@u****l | 5 |
| dependabot[bot] | 4****] | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
Dependencies
- beautifulsoup4 *
- lxml *
- beautifulsoup4 ==4.6.0
- lxml ==4.9.1
- beautifulsoup4 *
- lxml *