llpro

Pipeline for Processing German Literary Texts. Work in Progress.

https://github.com/cophi-wue/llpro

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
✓
Institutional organization owner
Organization cophi-wue has institutional domain (www.germanistik.uni-wuerzburg.de)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Pipeline for Processing German Literary Texts. Work in Progress.

Basic Info

Host: GitHub
Owner: cophi-wue
License: gpl-3.0
Language: Prolog
Default Branch: main
Size: 5.41 MB

Statistics

Stars: 11
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 1

Created about 4 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

LLpro – A Literary Language Processing Pipeline for German Narrative Texts

An NLP Pipeline for German literary texts implemented in Python and Spacy (v3.5.2). Work in progress.

This pipeline implements several custom pipeline components using the Spacy API. Currently the components perform * Tokenization and Sentence Splitting via SoMaJo (Proisl, Uhrig 2016). Version 2.4. * POS tagging via SoMeWeTa (Proisl 2018). Version 1.8.1. * Lemmatization and Morphological Analysis via RNNTagger (Schmid 2019). Version 1.4.1. * Dependency Parsing via ParZu (Sennrich, Schneider, Volk, Warin 2009; Sennrich, Volk, Schneider 2013; Sennrich, Kunz 2014). Commit a15ae7f. * Named Entity Recognition via FLERT (Schweter, Akbik 2021). Version 0.12.2. * Recognition of References to literary Characters (proper nouns and common nouns, i.e. “Appelative”, cf. Krug et al., 2017) via a custom fine-tuned FLERT model aehrm/droc-character-recognizer. * Tagging of German speech, thought and writing representation (STWR) via custom fine-tuned BERT embeddings, similar to Brunner, Tu, Weimer, Jannidis (2020); model aehrm/modernbert-redewiedergabe. * Segmentation into Scenes via BERT Embeddings via a custom fine-tuned re-implementation of a model by Kurfalı and Wirén (2021); model aehrm/stss-scene-segmenter. * Coreference Resolution via BERT Embeddings (Schröder, Hatzel, Biemann 2021). Commit f34a99e. * Annotating Event Types to verbal phrases via BERT Embeddings (Vauth, Hatzel, Gius, Biemann 2021) Version 0.2, Commit 25fdf7e.

See also the section about the Output Format for a description of the tabular output format.

Usage

```text usage: llpro_cli.py [-h] [-v] [--version] [-X OPT] [--stdout | --writefiles DIR] --infiles FILE [FILE ...]

NLP Pipeline for literary texts written in German.

optional arguments: -h, --help show this help message and exit -v, --verbose --version show program's version number and exit -X OPT, --component-config OPT Component parameters of the form component_name.opt=value --stdout Write all processed tokens to stdout. --writefiles DIR For each input file, write processed tokens to a separate file in DIR. --infiles FILE [FILE ...] Input files, or directories. ```

Note: you can specify the resources directory (containing ParZu etc.) with the environment variable LLPRO_RESOURCES_ROOT, and the temporary workdir with the environment variable LLPRO_TEMPDIR.

Component options

Several components can be configured with the -X key. Notably:

-X somajo_tokenizer.is_pretokenized=True skips tokenization, and assumes tokens separated by whitespace.
-X somajo_tokenizer.is_presentencized=True skips sentence splitting, and assumes sentences separated by newlines.
-X somajo_tokenizer.normalize_tokens=False does not normalize tokens. Incompatible with is_pretokenized=False.
-X somajo_tokenizer.paragraph_separator='PAT' sets the paragraph separator pattern. Input text is split into paragraph at pattern occurences, and sentences always terminate on paragraph boundaries. Like Python's re.split, if capturing parentheses are used in PAT, then the text of each group in the pattern is also returned as paragraph. Performed before tokenization/sentence splitting.
-X somajo_tokenizer.section_pattern='PAT' sets the sectioning paragraph pattern. Paragraphs fully matching the pattern are removed, except any group captured by parentheses used in PAT. Performed before tokenization/sentence splitting.
-X coref_uhhlt.split_method='section' performs coreference only on section-level.
-X <component_name>.disable=True disables the specific component

Installation

The LLpro pipeline can be run either locally or as a Docker container. Running the pipeline using Docker is strongly recommended.

WINDOWS USERS: For building the Docker image, clone using shell git clone https://github.com/aehrm/LLpro --config core.autocrlf=input to preserve line endings.

Building and running the Docker image

We strongly recommend using Docker to run the pipeline. With the provided Dockerfile, all dependencies and prerequisites are downloaded automatically.

```shell cd LLpro docker build --tag cophiwue/llpro .

or, if you want experimental features enabled

docker build --build-arg LLPRO_EXPERIMENTAL=1 --tag cophiwue/llpro-experimental .

```

After building, the Docker image can be run like this:

```shell mkdir -p files/in files/out chmod a+w files/out # make directory writeable from the Docker container

copy files into ./files/in to be processed

docker run \ --rm \ -e OMPNUMTHREADS=4 \ --gpus all \ # alternatively, e.g., --gpus "device=0" --interactive \ --tty \ -a stdout \ -a stderr \ -v "$(pwd)/files:/files" \ cophiwue/llpro -v --writefiles /files/out --infiles /files/in

processed files are located in ./files/out

```

Installing locally

Verify that the following dependencies are installed:

Python (tested on version 3.7)
For RNNTagger
- CUDA (tested on version 11.4)
For Parzu:
- SWI-Prolog >= 5.6
- SFST >= 1.4

Execute poetry install and ./prepare.sh. The script downloads all remaining prerequisites. Example usage:

```shell poetry install ./prepare.sh

NOTICE: use the prepared poetry venv!

poetry run python ./bin/llpro_cli.py -v --writefiles files/out files/in

if desired, run tests

poetry run pytest -vv ```

Developer Guide

See the separate Developer Guide about the implemented Spacy components and how to access the assigned attributes.

See also the separate document about the tabular Output Format for a description of the output format and a reference of the used tagsets.

See the folder ./contrib for scripts to reproduce the fine-tuning of the custom models.

Citing

If you use the LLpro software for academic research, please consider citing the accompanying publication:

Ehrmanntraut, Anton, Leonard Konle, and Fotis Jannidis. 2023. „LLpro: A Literary Language Processing Pipeline for German Narrative Text.“ In Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023), ed. Munir Georges, Aaricia Herygers, Annemarie Friedrich and Benjamin Roth, pp. 28–39. Ingolstadt, Germany: Association for Computational Linguistics. https://aclanthology.org/2023.konvens-main.3/

bibtex @inproceedings{ehrmanntraut-etal-2023-llpro, title = "{LL}pro: A Literary Language Processing Pipeline for {G}erman Narrative Texts", author = "Ehrmanntraut, Anton and Konle, Leonard and Jannidis, Fotis", editor = "Georges, Munir and Herygers, Aaricia and Friedrich, Annemarie and Roth, Benjamin", booktitle = "Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)", date = "2023-09-18", address = "Ingolstadt, Germany", publisher = "Association for Computational Lingustics", url = "https://aclanthology.org/2023.konvens-main.3/", pages = "28--39" }

License

In accordance with the license terms of ParZu+Zmorge (GPL v2), and of SoMeWeTa (GPL v3) the LLpro pipeline is licensed under the terms of GPL v3. See LICENSE.

NOTICE: The code of the ParZu parser located in resources/ParZu has been modified to be compatible with LLpro. See git log -p df1e91a.. -- resources/ParZu for a summary of these changes.

NOTICE: Some subsystems and resources used by the LLpro pipeline have additional license terms:

RNNTagger: see https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/Tagger-Licence
SoMeWeTa model german_web_social_media_2020-05-28.model: derived from the TIGER corpus; see https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/license/htmlicense.html

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

References

Akbik, Alan, Duncan Blythe, and Roland Vollgraf. 2018. “Contextual String Embeddings for Sequence Labeling.” In COLING 2018, 27th International Conference on Computational Linguistics, 1638–49.

Brunner, Annelen, Ngoc Duyen Tanja Tu, Lukas Weimer, and Fotis Jannidis. 2021. “To BERT or Not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of Four Types of Speech, Thought and Writing Representation.” In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), 2624:11. CEUR Workshop Proceedings. Zurich, Switzerland. http://ceur-ws.org/Vol-2624/paper5.pdf.

Krug, Markus, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, and Fotis Jannidis. 2017. “Description of a Corpus of Character References in German Novels - DROC [Deutsches ROman Corpus].” https://resolver.sub.uni-goettingen.de/purl?gro-2/108301.

Kurfalı, Murathan, and Mats Wirén. 2021. “Breaking the Narrative: Scene Segmentation Through Sequential Sentence Classification.” In Proceedings of the Shared Task on Scene Segmentation, edited by Albin Zehe, Leonard Konle, Lea Dümpelmann, Evelyn Gius, Svenja Guhr, Andreas Hotho, Fotis Jannidis, et al., 3001:49–53. CEUR Workshop Proceedings. Düsseldorf, Germany. http://ceur-ws.org/Vol-3001/#paper6.

Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–70. Miyazaki, Japan: European Language Resources Association ELRA. http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf.

Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 57–62. Berlin, Germany: Association for Computational Linguistics (ACL). http://aclweb.org/anthology/W16-2607.

———. 2019. “Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts.” In DATeCH, Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, 133–37. Brussels, Belgium: Association for Computing Machinery. https://www.cis.uni-muenchen.de/~schmid/papers/Datech2019.pdf.

Schröder, Fynn, Hans Ole Hatzel, and Chris Biemann. 2021. “Neural End-to-End Coreference Resolution for German in Different Domains.” In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), 170–81. Düsseldorf, Germany: KONVENS 2021 Organizers. https://aclanthology.org/2021.konvens-1.15.

Schweter, Stefan, and Alan Akbik. 2021. “FLERT: Document-Level Features for Named Entity Recognition.” arXiv:2011.06993 [Cs], May. http://arxiv.org/abs/2011.06993.

Sennrich, Rico, and Beat Kunz. 2014. “Zmorge: A German Morphological Lexicon Extracted from Wiktionary.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1063–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf.

Sennrich, Rico, G. Schneider, M. Volk, M. Warin, C. Chiarcos, Richard Eckart de Castilho, and Manfred Stede. 2009. “A New Hybrid Dependency Parser for German.” In Proceedings of the GSCL Conference. Potsdam, Germany. https://doi.org/10.5167/UZH-25506.

Sennrich, Rico, Martin Volk, and Gerold Schneider. 2013. “Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-Tagging, and Morphological Analysis.” In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, 601–9. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA. https://www.aclweb.org/anthology/R13-1079.

Vauth, Michael, Hans Ole Hatzel, Evelyn Gius, and Chris Biemann. 2021. “Automated Event Annotation in Literary Texts.” In Proceedings of the Conference on Computational Humanities Research 2021, edited by Maud Ehrmann, Folgert Karsdorp, Melvin Wevers, Tara Lee Andrews, Manuel Burghardt, Mike Kestemont, Enrique Manjavacas, Michael Piotrowski, and Joris van Zundert, 2989:333–45. CEUR Workshop Proceedings. Amsterdam, the Netherlands. https://ceur-ws.org/Vol-2989/#short_paper18.

Owner

Name: Computerphilologie Uni Würzburg
Login: cophi-wue
Kind: organization
Location: Würzburg

Website: http://www.germanistik.uni-wuerzburg.de/lehrstuehle/computerphilologie/startseite/
Repositories: 14
Profile: https://github.com/cophi-wue

Lehrstuhl für Computerphilologie, Julius-Maximilians-Universität Würzburg

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ehrmanntraut"
  given-names: "Anton"
- family-names: "Konle"
  given-names: "Leonard"
- family-names: "Jannidis"
  given-names: "Fotis"
title: "LLpro: A Literary Language Processing Pipeline for German Narrative Texts"
date-released: "2022-09-05"
license: GPL-3.0-or-later
url: "https://github.com/cophi-wue/LLpro"
version: 0.1.0
preferred-citation:
  type: conference-paper
  authors:
  - family-names: "Ehrmanntraut"
    given-names: "Anton"
  - family-names: "Konle"
    given-names: "Leonard"
  - family-names: "Jannidis"
    given-names: "Fotis"
  title: "LLpro: A Literary Language Processing Pipeline for German Narrative Texts"
  collection-title: "Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)"
  year: 2023

GitHub Events

Total

Issues event: 2
Watch event: 1
Issue comment event: 1
Push event: 10
Pull request event: 2
Fork event: 1

Last Year

Issues event: 2
Watch event: 1
Issue comment event: 1
Push event: 10
Pull request event: 2
Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 1
Average time to close issues: 25 days
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: 25 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

thvitt (1)

Pull Request Authors

thvitt (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

resources/ParZu/Dockerfile docker

ubuntu 16.04 build

resources/uhh-lt-neural-coref/docker/Dockerfile docker

pytorch/torchserve ${TORCHSERVE_TAG} build

resources/uhh-lt-event-classify/poetry.lock pypi

121 dependencies

Dockerfile docker

nvidia/cuda 11.3.1-devel-ubuntu20.04 build

resources/uhh-lt-neural-coref/docker/requirements.txt pypi

graphviz *
numpy *
pyhocon *
scikit-learn ==0.22.1
tensorboard *
tqdm ==4.56.0
transformers ==4.2.1

resources/uhh-lt-neural-coref/requirements.txt pypi

graphviz *
numpy *
pyhocon *
scikit-learn ==0.22.1
tensorboard *
tqdm ==4.56.0
transformers ==4.9.1

poetry.lock pypi

128 dependencies

pyproject.toml pypi

datasets ^2.12.0 develop
de-dep-news-trf * develop
pytest >=5.2 develop
Cython ~=0.29
SoMaJo ^2.2
SoMeWeTa ~=1.8.1
cython ~=0.29
dill 0.3.6
flair ^0.12.2
more-itertools ^9.0.0
multiprocessing_on_dill ^3.5.0-alpha.4
omegaconf ^2.3.0
overrides ^7.3.1
pandas 1.3
pexpect ^4.8.0
pyhocon ^0.3.59
python >=3.7.1,<3.11
pytorch-transformers ^1.2.0
regex ^2022.10.31
spacy ~=3.5
spacy-transformers ^1.2.1
torch >=1.11

resources/uhh-lt-event-classify/pyproject.toml pypi