Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
Transformation of ALTO 4 to TEI P5
Basic Info
- Host: GitHub
- Owner: kat-kel
- License: cc0-1.0
- Language: Python
- Default Branch: main
- Size: 934 MB
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
ALTO4 to TEI P5
This application prepares a TEI edition of a digitzed document whose pages were transcribed and encoded in ALTO4 files using the HTR tool eScriptorium.
It follows SegmOnto's controlled vocabulary and has been designed as part of the Gallic(orpor)a pipeline.
How To Use
Requirements
- Python 3.7
- Bash
- A file-naming system that organizes all of a transcription's ALTO-XML files into a single sub-directory, whose name is identical to the document's Archival Resource Key (ARK). In example, all transcribed pages of the document whose ARK is
btv1b8613380tare stored in the directory./data/btv1b8613380t/.alto2tei/ │ README.md │ ... │ └───data/ │ │ │ └───btv1b8613380t/ │ │ │ f1.xml │ │ │ f2.xml │ │ │ ... │ └───bpt6k324358v/ │ │ f1.xml │ │ f2.xml │ │ ...
Steps
- Download this repository from GitHub.
shell $ git clone https://github.com/kat-kel/alto2tei.git $ cd alto2tei - Create and activate a virtual environment in which to install the application.
shell $ python3.7 -m ".venv-alto2tei" $ source .venv-alto2tei/bin/activate - Install the application.
shell $ bash install.sh - Configure the application.
./config.yml
yaml
data:
# specify the relative path of your data files
path: "./data"
5. Use the application.
shell
$ alto2tei --config config.yml --version "3.0.13" --header --sourcedoc --body
Required Arguments:
- --config (string): specify the location of the configuration file
- --version (string): specify the version number of Kraken
Optional Arguments:
- --header (boolean): include if you want a <teiHeader>
- --sourcedoc (boolean): include if you want a <sourceDoc>
- --body (boolean): include if you want a <body>; this can only called if the --sourcedoc option was also called
Compatability
Document Metadata
Currently, the application is designed to scrape metadata for the <teiHeader> from three resources related to the Bibliothèque nationale de France's Gallica repository.
1. Gallica's servers
2. Bibliothèque nationale de France's catalogue général
3. SUDOC's Répertoire des centres de ressources
Therefore, only transcriptions of digital exemplars from Gallica can take full advantage of the application's automatically generated <teiHeader>. However, the URI syntax that this application uses to retrieve data from Gallica's servers is the same syntax used by any other institution that participates, like the Bibliothèque nationale de France, in the IIIF. To adjust this URI and work with transcriptions of digital exemplars stored on other institutions' servers, edit the parameters "scheme," "server," and "prefix" in this application's configuration file. These parameters will be inserted into a string with the following pythonic syntax:
python
f"{scheme}://{server}{manifest_prefix}{ARK}{manifest_suffix}"
The first three parameters in the IIIF URI can be modified in the configuration file as follows:
yaml
scheme: "https"
server: "gallica.bnf.fr"
manifest_prefix: "/iiif/ark:/12148/"
image_prefix: "/iiif/ark:/12148/" # for Gallica, same as manifest
manifest_suffix: "/manifest.json"
An example of this URI, constructed for the document with the ARK "bpt6k324358v" is:
https://gallica.bnf.fr/iiif/ark:/12148/bpt6k324358v/manifest.json
The application has been designed and tested on IIIF manifest data typical of text documents distributed on Gallica. Its adaptability to how other institutions have encoded data in a IIIF manifest cannot be guaranteed.
Transcription Data
The application can produce a <sourceDoc> from any ALTO 4 files that were created by the Kraken engine, including those produced inside the eScriptorium interface. The source document does not need to be part of the Bibliothèque nationale de France's collections, its digital exemplars do not need to be distributed on Gallica, and the machine transcription does not need to have been made with models trained on the SegmOnto controlled vocabulary. The TEI element <sourceDoc> that this application generates adapts to any ALTO 4 files that resemble the formats produced by Kraken's engine.
Pre-Annotated Text Body
Currently, the application is designed to recognize zones and lines of text on a page whose labels conform to SegmOnto's controlled vocabulary. The application cannot generate a <body> from ALTO-XML files in which a line or zone's @TAGREF is not part of the SemgOnto vocabulary.
However, with an XSL Transformation, a user can extract specific lines of text from the <sourceDoc> according to their own @TAGREF system and custom build the TEI-XML file's <body>.
Owner
- Name: Kelly Christensen
- Login: kat-kel
- Kind: user
- Repositories: 10
- Profile: https://github.com/kat-kel
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this corpus, please cite it as below."
authors:
- family-names: Christensen
given-names: Kelly
orcid: https://orcid.org/0000-0002-7236-874X
- family-names: Gabay
given-names: Simon
orcid: https://orcid.org/0000-0001-9094-4475
- family-names: Pinche
given-names: Ariane
orcid: https://orcid.org/0000-0002-7843-5050
title: "Alto 2 Tei"
date-released: 2022
url: "https://github.com/kat-kel/alto2tei"
identifiers:
- type: doi
value: 'zenodo'
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- alix-tz (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- PyYAML ==6.0
- certifi ==2022.6.15
- charset-normalizer ==2.1.0
- idna ==3.3
- lxml ==4.9.1
- requests ==2.28.1
- urllib3 ==1.26.11