alto2tei

Transformation of ALTO 4 to TEI P5

https://github.com/kat-kel/alto2tei

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Transformation of ALTO 4 to TEI P5

Basic Info
  • Host: GitHub
  • Owner: kat-kel
  • License: cc0-1.0
  • Language: Python
  • Default Branch: main
  • Size: 934 MB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created almost 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

ALTO4 to TEI P5

This application prepares a TEI edition of a digitzed document whose pages were transcribed and encoded in ALTO4 files using the HTR tool eScriptorium.

It follows SegmOnto's controlled vocabulary and has been designed as part of the Gallic(orpor)a pipeline.

How To Use

Requirements

  • Python 3.7
  • Bash
  • A file-naming system that organizes all of a transcription's ALTO-XML files into a single sub-directory, whose name is identical to the document's Archival Resource Key (ARK). In example, all transcribed pages of the document whose ARK is btv1b8613380t are stored in the directory ./data/btv1b8613380t/. alto2tei/ │ README.md │ ... │ └───data/ │ │ │ └───btv1b8613380t/ │ │ │ f1.xml │ │ │ f2.xml │ │ │ ... │ └───bpt6k324358v/ │ │ f1.xml │ │ f2.xml │ │ ...

Steps

  1. Download this repository from GitHub. shell $ git clone https://github.com/kat-kel/alto2tei.git $ cd alto2tei
  2. Create and activate a virtual environment in which to install the application. shell $ python3.7 -m ".venv-alto2tei" $ source .venv-alto2tei/bin/activate
  3. Install the application. shell $ bash install.sh
  4. Configure the application.

./config.yml yaml data: # specify the relative path of your data files path: "./data" 5. Use the application. shell $ alto2tei --config config.yml --version "3.0.13" --header --sourcedoc --body Required Arguments: - --config (string): specify the location of the configuration file - --version (string): specify the version number of Kraken

Optional Arguments: - --header (boolean): include if you want a <teiHeader> - --sourcedoc (boolean): include if you want a <sourceDoc> - --body (boolean): include if you want a <body>; this can only called if the --sourcedoc option was also called

Compatability

Document Metadata

Currently, the application is designed to scrape metadata for the <teiHeader> from three resources related to the Bibliothèque nationale de France's Gallica repository. 1. Gallica's servers 2. Bibliothèque nationale de France's catalogue général 3. SUDOC's Répertoire des centres de ressources

Therefore, only transcriptions of digital exemplars from Gallica can take full advantage of the application's automatically generated <teiHeader>. However, the URI syntax that this application uses to retrieve data from Gallica's servers is the same syntax used by any other institution that participates, like the Bibliothèque nationale de France, in the IIIF. To adjust this URI and work with transcriptions of digital exemplars stored on other institutions' servers, edit the parameters "scheme," "server," and "prefix" in this application's configuration file. These parameters will be inserted into a string with the following pythonic syntax: python f"{scheme}://{server}{manifest_prefix}{ARK}{manifest_suffix}" The first three parameters in the IIIF URI can be modified in the configuration file as follows: yaml scheme: "https" server: "gallica.bnf.fr" manifest_prefix: "/iiif/ark:/12148/" image_prefix: "/iiif/ark:/12148/" # for Gallica, same as manifest manifest_suffix: "/manifest.json" An example of this URI, constructed for the document with the ARK "bpt6k324358v" is:

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k324358v/manifest.json

The application has been designed and tested on IIIF manifest data typical of text documents distributed on Gallica. Its adaptability to how other institutions have encoded data in a IIIF manifest cannot be guaranteed.

Transcription Data

The application can produce a <sourceDoc> from any ALTO 4 files that were created by the Kraken engine, including those produced inside the eScriptorium interface. The source document does not need to be part of the Bibliothèque nationale de France's collections, its digital exemplars do not need to be distributed on Gallica, and the machine transcription does not need to have been made with models trained on the SegmOnto controlled vocabulary. The TEI element <sourceDoc> that this application generates adapts to any ALTO 4 files that resemble the formats produced by Kraken's engine.

Pre-Annotated Text Body

Currently, the application is designed to recognize zones and lines of text on a page whose labels conform to SegmOnto's controlled vocabulary. The application cannot generate a <body> from ALTO-XML files in which a line or zone's @TAGREF is not part of the SemgOnto vocabulary.

However, with an XSL Transformation, a user can extract specific lines of text from the <sourceDoc> according to their own @TAGREF system and custom build the TEI-XML file's <body>.

Owner

  • Name: Kelly Christensen
  • Login: kat-kel
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this corpus, please cite it as below."
authors:
  - family-names: Christensen
    given-names: Kelly
    orcid: https://orcid.org/0000-0002-7236-874X
  - family-names: Gabay
    given-names: Simon
    orcid: https://orcid.org/0000-0001-9094-4475
  - family-names: Pinche
    given-names: Ariane
    orcid: https://orcid.org/0000-0002-7843-5050
title: "Alto 2 Tei"
date-released: 2022
url: "https://github.com/kat-kel/alto2tei"
identifiers:
  - type: doi
    value: 'zenodo'

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • alix-tz (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

setup.py pypi
  • PyYAML ==6.0
  • certifi ==2022.6.15
  • charset-normalizer ==2.1.0
  • idna ==3.3
  • lxml ==4.9.1
  • requests ==2.28.1
  • urllib3 ==1.26.11