ssrq-retro-lab
Experiments on the application of generative AI for the retro-digitisation of printed editions
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 21 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Repository
Experiments on the application of generative AI for the retro-digitisation of printed editions
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
SSRQ Retro Lab 
This repository contains code (Python scripts as well as Jupyter notebooks) and data of retrodigitized units from the collection of Swiss Law Sources (SLS). The data is used for various experiments to evaluate the quality of the digitization process, improve the quality of OCR results, and develop a workflow for the retrodigitization of the SLS collection. Furthermore, it demonstrates potential ways for further use of the data by employing advanced methods such as topic modeling or named entity recognition.
Table of Contents
- SSRQ Retro Lab
Background
Swiss Law Sources
The Swiss law Sources were established at the end of the 19th century by the Swiss Lawyers' Association with the aim of making sources of the legal history of Switzerland accessible to an interested public. The collection of legal sources is nowadays supported by a foundation established in 1980. Part of this foundation is the ongoing research project under the direction of Pascale Sutter. About 15 years ago, the Foundation decided to start digitizing the collection of Swiss law sources. The result of this process is the online platform called "SSRQ Online", which makes all scanned books as PDFs available to the public. The PDFs have been further processed by OCR software, but no correction or any other post-processing (e.g. annotation of named entities) has been done so far. The PDFs are just the starting point for a long journey of further processing and analysis.
Idea of the 'Retro Lab'
The idea of the 'Retro Lab' is to use the digitized volumes of the SLS collection as a test bed for various experiments. Different methods and tools are used to evaluate the quality of the digitization process, to improve the quality of the OCR results and to develop a workflow for the retrodigitization of the SLS collection. A special focus is on the usage of generative AI models like GPT-3.5/4 to create an advanced processing pipeline, where most of the hard work is done by the AI.
Data and Code
Data
The data is stored in the folder data. It contains the following subfolders:
- export: Contains a ground truth transcription of 53 pages from two volumes. This transcription was created in transkribus and exported as a txt file.
- ZG: Contains the OCR results of the volume "ZG" (Zug) as a PDF file. The OCR results were created by the OCR software ABBYY Finereader. Furthermore, it contains training and validation data as
txt- andjson-files.
Code
The code of the project is divided into two parts:
- Utility code, organized in python modules (everything beneath
src) - Analysis code, organized in Jupyter notebooks (everything beneath
notebooks)
All dependencies are listed in the pyproject.toml file. The code is written for python >= 3.11. The management of virtual environments is done with hatch. To create a new virtual environment, run hatch env create in the root directory of the project. To activate the environment, run hatch env shell. The environment will have all dependencies installed.
Note: You will need a valid API key for the OpenAI API to run the notebooks.
Experiments
v1 of the experiment
For the first iteration of the experiments take a look at the v1-branch.
v2 of the experiments
The second iteration of the experiments tries to use a slightly different approach. Instead of just relying on the extracted plain text and trying to use a Large Language Model for further processing (like recognition of different documents) a mixed approach is used, which combines 'classical' methods with the usage of a Large Language Model. Therefore, a pipeline is created, which uses a combination of Python scripts and calls a LLM just for the parts, where it is really needed. The pipeline is shown in the following figure:

Each component is validated by a simple set of tests, which are located in the tests folder.
No Langchain – why? – Langchain is a powerful, but also complex, framework. Most of it's features are not needed for the experiments. Instead a custom pipeline (chain) is created, which is tailored to the needs of the experiments.
Pipeline Components
Text Extraction Component
The text extraction component is responsible for extracting the plain text (as HTML string) from the PDF(s). The document / article number is used as an input parameter. Beside the PDF it uses the XML Table of Content, which is the base for "SSRQ Online". It returns an object with the following structure:
python
class TextExtractionResult(TypedDict):
entry: VolumeEntry
pages: tuple[str, ...]
The VolumeEntry is a simple data class, which contains the metadata of the volume. The pages attribute is a tuple of strings, where each string represents the extracted text of a page as HTML string. The extraction and HTML conversion is handled by PyMuPDF.
HTML Wrangler Component
This component takes the extracted HTML string(s) and tries to extract the relevant HTML elements for the requested article. Like the first component it does not use a LLM.
Shortcoming of this component:
- Relies on the correct structure of the HTML string
- Relies on the OCR results of the PDF
- Quick & dirty implementation to find the relevant HTML elements
Returns the following object:
python
class HTMLTextExtractionResult(TextExtractionResult):
article: tuple[Selector, ...]
Text Classifier Component
This component uses the nodes, extracted in the previous step, and tries to classify each of them. Uses the default GPT-4 model from OpenAI. The classification is done by using a prompt with Few-Shot examples:
Returns the following object:
python
class StructuredArticle(BaseModel):
article_number: int
date: str
references: list[str]
summary: list[str]
text: list[str]
title: str
The component is tested against a few examples. The accuracy of this test cases is above 90%. See the test cases for more details.
OCR Correction Component
This component uses the structured article created in the previous step and tries to correct the OCR results. It uses a fine-tuned GPT-3.5 model for this task. The data used for fine-tuning can be found in the data section. Some validation is done in a Jupyter notebook.
It returns the following object:
python
class StructuredCorrectedArticle(StructuredArticle):
corrected_references: CorrectedOCRText
corrected_summary: CorrectedOCRText
corrected_text: CorrectedOCRText
Things left open:
- [ ] Correct summary and references
- [ ] Implement better validation for the OCR correction
NER Annotation Component
As the last processing step some Named Entity Recognition (NER) is done. The NER is backed by spacy-llm, which uses a GPT-4 model and parses the output into a structured spacy document. Some simple validation is done here.
It returns a tuple, which contains the StructuredCorrectedArticle and the spacy.Doc.
TEI Conversion Component
Last but not least the result is converted into a TEI XML file. The TEI XML file is created by using a simple template, which is filled with the data of the StructuredCorrectedArticle and the spacy.Doc. The template can be found here.
Demo
The following video shows a demo of the complete process in a simple UI built with gradio. To speed up the process an article is used, which was already processed. The results for this article are retrieved from the cache. The cache is implemented with diskcache-library.
To-Dos
This experiment is a first prototype, it is not ready for production use and there are some things left open:
- [ ] Implement better validation for all components
- [ ] Implement a 'Human in the Loop' for all steps in the pipeline
- [ ] Improve performance by bundling requests and / or using concurrent requests to external service (like OpenAI API)
- [ ] Implement checks for the prompts send to the LLM (e.g. check for the length of the prompt)
- ...
Talks
The work done here will be presented in the context of the following talks:
- Bastian Politycki, Pascale Sutter, Christian Sonder: „Datenschätze heben. Ein Bericht zur Digitalisierung der Sammlung Schweizerischer Rechtsquellen“. Editions als Transformation. Plenartagung der AG für germanistische Edition, 21.–24. Februar 2024, Bergische Universität Wuppertal. Slides will be linked here after the talk.
- Bastian Politycki: „Anwendung generativer KI zur Digitalisierung gedruckter Editionen am Beispiel der Sammlung Schweizerischer Rechtsquellen“. W8: Generative KI, LLMs und GPT bei digitalen Editionen, DHd2024 Passau, 26.02.2024—01.03.2024. Slides will be linked here after the talk.
Authors
Bastian Politycki – University of St. Gallen / Swiss Law Sources
References
- Ekin, Sabit. „Prompt Engineering For ChatGPT: A Quick Guide To Techniques, Tips, And Best Practices“. Preprint, 29. April 2023. https://doi.org/10.36227/techrxiv.22683919.v1.
- González-Gallardo, Carlos-Emiliano, Emanuela Boros, Nancy Girdhar, Ahmed Hamdi, Jose G. Moreno, und Antoine Doucet. „Yes but.. Can ChatGPT Identify Entities in Historical Documents?“ arXiv, 30. März 2023. https://doi.org/10.48550/arXiv.2303.17322.
- Liu, Yuliang, Zhang Li, Hongliang Li, Wenwen Yu, Yang Liu, Biao Yang, Mingxin Huang, u. a. „On the Hidden Mystery of OCR in Large Multimodal Models“. arXiv, 18. Juni 2023. https://doi.org/10.48550/arXiv.2305.07895.
- - Møller, Anders Giovanni, Jacob Aarup Dalsgaard, Arianna Pera, und Luca Maria Aiello. „Is a Prompt and a Few Samples All You Need? Using GPT-4 for Data Augmentation in Low-Resource Classification Tasks“. arXiv, 26. April 2023. http://arxiv.org/abs/2304.13861.
- Pollin, C. (2023). Workshopreihe "Angewandte Generative KI in den (digitalen) Geisteswissenschaften" (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.10065626.
- Rockenberger, Annika. „Automated Text Recognition with ChatGPT 4“. Annika Rockenberger (blog), 19. Oktober 2023. https://www.annikarockenberger.com/2023-10-19/automated-text-recognition-with-chatgpt-4/.
- Zhou, Wenxuan, Sheng Zhang, Yu Gu, Muhao Chen, und Hoifung Poon. „UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition“, 2023. https://doi.org/10.48550/ARXIV.2308.03279.
Tools used
- Jinja2 - for creating programmable prompt templates
- OpenAI Python SDK
- parsel
- pytest
- spacy
For a complete list see pyproject.toml.
Owner
- Name: Rechtsquellenstiftung des Schweizerischen Juristenvereins
- Login: SSRQ-SDS-FDS
- Kind: organization
- Email: info@ssrq-sds-fds.ch
- Location: St. Gallen
- Website: www.ssrq-sds-fds.ch
- Repositories: 1
- Profile: https://github.com/SSRQ-SDS-FDS
Research institution that publishes sources of old law up to 1798 in the collection of Swiss law sources.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Politycki"
given-names: "Bastian"
orcid: "https://orcid.org/0000-0002-6308-2424"
title: "SSRQ-Retro-Lab: Exploring the Application of LLMs for the Digitization of Printed Scholarly Editions."
version: 2.0.0-alpha.1
url: "https://github.com/SSRQ-SDS-FDS/ssrq-schema"