abbreviationes

Automated pipeline for expanding medieval Latin abbreviations encoded in TEI using finetuned ByT5. Drop your TEI files, run five scripts, and get a Hugging Face dataset plus a lightweight LoRA adapter for ByT5 that turns graphemic ATR output into expanded text.

https://github.com/michaelscho/abbreviationes

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Keywords

abbreviations byt5 text2text

Scientific Fields

Sociology Social Sciences - 87% confidence

Artificial Intelligence and Machine Learning Computer Science - 76% confidence

Engineering Computer Science - 60% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: michaelscho
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 14.6 KB

Statistics

Stars: 1
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 1

Topics

abbreviations byt5 text2text

Created 7 months ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

README.md

Abbreviationes

Project Summary

The repository delivers an end‑to‑end pipeline for extracting and balancing data, building a Hugging Face dataset, fine‑tuning a model, and running inference for the automatic expansion of scribal abbreviations in medieval texts. Originally designed for Burchards Dekret Digital (Akademie der Wissenschaften und der Literatur │ Mainz), the code has been generalized that any TEI corpus that marks abbreviations with

xml <choice><abbr>…</abbr><expan>…</expan></choice>

can be processed with minimal adaptions. Put your XML files into ./data/input/, execute the scripts in numerical order, and you will obtain:

3‑line sliding‑window pairs (abbr → expan)
optional balanced variant to better represent rare abbreviations
a Hugging Face Dataset (local or pushed to the Hub)
a LoRA adapter for google/byt5‑base

Background and Motivation

The Decretum Burchardi, compiled in the early eleventh century, contains a dense and systematic use of abbreviations. In preparing a digital critical edition it is therefore necessary to record, for every occurrence, both the original abbreviation and its editorial expansion. TEI‑XML enables such encoding through the <choice> element, which encloses an <abbr> (abbreviated form) and an <expan> (expanded form). Manually inserting such structures throughout a long manuscript is labour‑intensive, thus editors have resorted to static wordlists or trained text recognition models that output expanded text directly. As bot approaches have significant disadvantages, the present toolkit adopts another strategy based on deep learning and a seperation of concern: a graphemic ATR model first produces an un‑expanded transcription that preserves every brevigraph, and a separate ByT5 text‑to‑text model generates the expansions in a second processing stage to retain full transparency.

Methodology

1. Graphemic Transcription

A dedicated ATR model is trained to produces a graphemic transcription in which every brevigraph is mapped to a unique Unicode code point without normalisation or expansion to represent what the scribe actually wrote. This raw data then works as input for subsequent processing.

2. Abbreviation‑Expansion Pipeline

2.1. Ground‑Truth Data Creation (`01_create_ground_truth_from_tei.py`)

Save all TEI files in data/input/.
Segment data at <lb/> and merge breaks flagged break="no" so divided words are re‑united.
Slices the text and create a sliding window by concatenating three consecutive manuscript lines (WINDOW_SIZE is configurable). Because medieval manuscripts rarely mark sentence boundaries and ATR systems output one line at a time, the manuscript line is taken as the atomic unit in this process. The three‑line window restores a minimal grammatical context while remaining close to the eventual inference input, that will probably come from PAGE XML or simmilar format.
Extract pairs only when at least one <abbr> occurs, keeping two strings per window: • source_text with brevographs • target_text with each <expan> substituted.
Write the result to data/output/training_data.tsv.

Where to adapt: You can modify window size, extend the list of TEI elements to search (defaults: <p>, <head>, <note>), or adjust the exclude list for project‑specific markup.

2.2. Balancing Rare Abbreviations (`02_augment_dataset.py`)

Rare brevigraphs risk being under‑represented. The script analyses abbreviation frequencies and duplicates any row containing a form that appears less than N times, producing training_data_augmented.tsv.

2.3. Dataset Packaging (`03_create_huggingface_dataset.py`)

Converts either TSV into a Hugging Face Dataset and pushes the data to your account.

2.4. Model Training (`04_train_model.py`)

Backbone: google/byt5‑base loaded in 8‑bit.
LoRA adapter: r = 32, α = 64. The adapter is trained for 5 epochs, long enough to reach convergence on a medium‑sized Latin corpus without over‑fitting, using a cosine learning‑rate schedule, bfloat16 arithmetic, and mixed precision to keep memory and energy consumption low.
Monitoring: NVML callback logs total GPU energy.
Output: adapters saved to ./models/ and optionally pushed to the Hub.

2.5. Inference (`05_use_model.py`)

Loads the ByT5 backbone plus LoRA adapter. Input can be either a plain‑text file (--file) or the built‑in demo lines. Beam‑5 decoding with nucleus sampling (top‑p 0.95) balances precision and diversity; a repetition penalty prevents degenerate loops. Output is printed to stdout for immediate post‑processing.

Setup

```bash

clone repo, then

python -m venv venv source venv/bin/activate pip install -r requirements.txt ```

Owner

Login: michaelscho
Kind: user

Repositories: 1
Profile: https://github.com/michaelscho

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Schonhardt"
  given-names: "Michael"
  orcid: "https://orcid.org/0000-0002-2750-1900"
title: "Abbreviationes"
version: 1.0.0
date-released: 2025-07-31
url: "https://github.com/michaelscho/Abbreviationes"

GitHub Events

Total

Release event: 1
Watch event: 1
Fork event: 1
Create event: 2

Last Year

Release event: 1
Watch event: 1
Fork event: 1
Create event: 2

Dependencies

requirements.txt pypi

Jinja2 ==3.1.6
MarkupSafe ==3.0.2
PyYAML ==6.0.2
accelerate ==1.9.0
aiohappyeyeballs ==2.6.1
aiohttp ==3.12.14
aiosignal ==1.4.0
attrs ==25.3.0
bitsandbytes ==0.46.1
certifi ==2025.7.14
charset-normalizer ==3.4.2
datasets ==4.0.0
dill ==0.3.8
filelock ==3.18.0
frozenlist ==1.7.0
fsspec ==2025.3.0
hf-xet ==1.1.5
huggingface-hub ==0.33.4
idna ==3.10
inquirerpy ==0.3.4
lxml ==6.0.0
mpmath ==1.3.0
multidict ==6.6.3
multiprocess ==0.70.16
networkx ==3.5
numpy ==2.3.1
nvidia-cublas-cu12 ==12.8.3.14
nvidia-cuda-cupti-cu12 ==12.8.57
nvidia-cuda-nvrtc-cu12 ==12.8.61
nvidia-cuda-runtime-cu12 ==12.8.57
nvidia-cudnn-cu12 ==9.7.1.26
nvidia-cufft-cu12 ==11.3.3.41
nvidia-cufile-cu12 ==1.13.0.11
nvidia-curand-cu12 ==10.3.9.55
nvidia-cusolver-cu12 ==11.7.2.55
nvidia-cusparse-cu12 ==12.5.7.53
nvidia-cusparselt-cu12 ==0.6.3
nvidia-ml-py ==12.575.51
nvidia-nccl-cu12 ==2.26.2
nvidia-nvjitlink-cu12 ==12.8.61
nvidia-nvtx-cu12 ==12.8.55
packaging ==25.0
pandas ==2.3.1
peft ==0.16.0
pfzy ==0.3.4
pillow ==11.0.0
prompt_toolkit ==3.0.51
propcache ==0.3.2
psutil ==7.0.0
pyarrow ==21.0.0
pynvml ==12.0.0
python-dateutil ==2.9.0.post0
pytz ==2025.2
regex ==2024.11.6
requests ==2.32.4
safetensors ==0.5.3
six ==1.17.0
sympy ==1.14.0
tokenizers ==0.21.2
torch ==2.7.1
torchaudio ==2.7.1
torchvision ==0.22.1
tqdm ==4.67.1
transformers ==4.53.2
triton ==3.3.1
typing_extensions ==4.14.1
tzdata ==2025.2
urllib3 ==2.5.0
wcwidth ==0.2.13
xxhash ==3.5.0
yarl ==1.20.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

abbreviationes

Science Score: 67.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Abbreviationes

Project Summary

Background and Motivation

Methodology

1. Graphemic Transcription

2. Abbreviation‑Expansion Pipeline

2.1. Ground‑Truth Data Creation (`01_create_ground_truth_from_tei.py`)

2.2. Balancing Rare Abbreviations (`02_augment_dataset.py`)

2.3. Dataset Packaging (`03_create_huggingface_dataset.py`)

2.4. Model Training (`04_train_model.py`)

2.5. Inference (`05_use_model.py`)

Setup

clone repo, then

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

abbreviationes

Science Score: 67.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Abbreviationes

Project Summary

Background and Motivation

Methodology

1. Graphemic Transcription

2. Abbreviation‑Expansion Pipeline

2.1. Ground‑Truth Data Creation (01_create_ground_truth_from_tei.py)

2.2. Balancing Rare Abbreviations (02_augment_dataset.py)

2.3. Dataset Packaging (03_create_huggingface_dataset.py)

2.4. Model Training (04_train_model.py)

2.5. Inference (05_use_model.py)

Setup

clone repo, then

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

1. Graphemic Transcription

2. Abbreviation‑Expansion Pipeline

2.1. Ground‑Truth Data Creation (`01_create_ground_truth_from_tei.py`)

2.2. Balancing Rare Abbreviations (`02_augment_dataset.py`)

2.3. Dataset Packaging (`03_create_huggingface_dataset.py`)

2.4. Model Training (`04_train_model.py`)

2.5. Inference (`05_use_model.py`)