llpro
Pipeline for Processing German Literary Texts. Work in Progress.
Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
✓Institutional organization owner
Organization cophi-wue has institutional domain (www.germanistik.uni-wuerzburg.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary
Repository
Pipeline for Processing German Literary Texts. Work in Progress.
Basic Info
- Host: GitHub
- Owner: cophi-wue
- License: gpl-3.0
- Language: Prolog
- Default Branch: main
- Size: 5.41 MB
Statistics
- Stars: 11
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
LLpro – A Literary Language Processing Pipeline for German Narrative Texts
An NLP Pipeline for German literary texts implemented in Python and Spacy (v3.5.2). Work in progress.
This pipeline implements several custom pipeline components using the Spacy API. Currently the components perform
* Tokenization and Sentence Splitting via SoMaJo (Proisl, Uhrig 2016). Version 2.4.
* POS tagging via SoMeWeTa (Proisl 2018). Version 1.8.1.
* Lemmatization and Morphological Analysis via RNNTagger (Schmid 2019). Version 1.4.1.
* Dependency Parsing via ParZu (Sennrich, Schneider, Volk, Warin 2009; Sennrich, Volk, Schneider 2013; Sennrich, Kunz 2014). Commit a15ae7f.
* Named Entity Recognition via FLERT (Schweter, Akbik 2021). Version 0.12.2.
* Recognition of References to literary Characters (proper nouns and common nouns, i.e. “Appelative”, cf. Krug et al., 2017) via a custom fine-tuned FLERT model aehrm/droc-character-recognizer.
* Tagging of German speech, thought and writing representation (STWR) via custom fine-tuned BERT embeddings, similar to Brunner, Tu, Weimer, Jannidis (2020); model aehrm/modernbert-redewiedergabe.
* Segmentation into Scenes via BERT Embeddings via a custom fine-tuned re-implementation of a model by Kurfalı and Wirén (2021); model aehrm/stss-scene-segmenter.
* Coreference Resolution via BERT Embeddings (Schröder, Hatzel, Biemann 2021). Commit f34a99e.
* Annotating Event Types to verbal phrases via BERT Embeddings (Vauth, Hatzel, Gius, Biemann 2021) Version 0.2, Commit 25fdf7e.
See also the section about the Output Format for a description of the tabular output format.
Usage
```text usage: llpro_cli.py [-h] [-v] [--version] [-X OPT] [--stdout | --writefiles DIR] --infiles FILE [FILE ...]
NLP Pipeline for literary texts written in German.
optional arguments: -h, --help show this help message and exit -v, --verbose --version show program's version number and exit -X OPT, --component-config OPT Component parameters of the form component_name.opt=value --stdout Write all processed tokens to stdout. --writefiles DIR For each input file, write processed tokens to a separate file in DIR. --infiles FILE [FILE ...] Input files, or directories. ```
Note: you can specify the resources directory (containing ParZu etc.) with the environment
variable LLPRO_RESOURCES_ROOT, and the temporary workdir with the environment variable LLPRO_TEMPDIR.
Component options
Several components can be configured with the -X key. Notably:
-X somajo_tokenizer.is_pretokenized=Trueskips tokenization, and assumes tokens separated by whitespace.-X somajo_tokenizer.is_presentencized=Trueskips sentence splitting, and assumes sentences separated by newlines.-X somajo_tokenizer.normalize_tokens=Falsedoes not normalize tokens. Incompatible withis_pretokenized=False.-X somajo_tokenizer.paragraph_separator='PAT'sets the paragraph separator pattern. Input text is split into paragraph at pattern occurences, and sentences always terminate on paragraph boundaries. Like Python'sre.split, if capturing parentheses are used in PAT, then the text of each group in the pattern is also returned as paragraph. Performed before tokenization/sentence splitting.-X somajo_tokenizer.section_pattern='PAT'sets the sectioning paragraph pattern. Paragraphs fully matching the pattern are removed, except any group captured by parentheses used in PAT. Performed before tokenization/sentence splitting.-X coref_uhhlt.split_method='section'performs coreference only on section-level.-X <component_name>.disable=Truedisables the specific component
Installation
The LLpro pipeline can be run either locally or as a Docker container. Running the pipeline using Docker is strongly recommended.
WINDOWS USERS: For building the Docker image, clone using
shell
git clone https://github.com/aehrm/LLpro --config core.autocrlf=input
to preserve line endings.
Building and running the Docker image
We strongly recommend using Docker to run the pipeline. With the provided Dockerfile, all dependencies and prerequisites are downloaded automatically.
```shell cd LLpro docker build --tag cophiwue/llpro .
or, if you want experimental features enabled
docker build --build-arg LLPRO_EXPERIMENTAL=1 --tag cophiwue/llpro-experimental .
```
After building, the Docker image can be run like this:
```shell mkdir -p files/in files/out chmod a+w files/out # make directory writeable from the Docker container
copy files into ./files/in to be processed
docker run \ --rm \ -e OMPNUMTHREADS=4 \ --gpus all \ # alternatively, e.g., --gpus "device=0" --interactive \ --tty \ -a stdout \ -a stderr \ -v "$(pwd)/files:/files" \ cophiwue/llpro -v --writefiles /files/out --infiles /files/in
processed files are located in ./files/out
```
Installing locally
Verify that the following dependencies are installed:
- Python (tested on version 3.7)
- For RNNTagger
- CUDA (tested on version 11.4)
- For Parzu:
- SWI-Prolog >= 5.6
- SFST >= 1.4
Execute poetry install and ./prepare.sh. The script downloads all remaining prerequisites.
Example usage:
```shell poetry install ./prepare.sh
NOTICE: use the prepared poetry venv!
poetry run python ./bin/llpro_cli.py -v --writefiles files/out files/in
if desired, run tests
poetry run pytest -vv ```
Developer Guide
See the separate Developer Guide about the implemented Spacy components and how to access the assigned attributes.
See also the separate document about the tabular Output Format for a description of the output format and a reference of the used tagsets.
See the folder ./contrib for scripts to reproduce the fine-tuning of the custom models.
Citing
If you use the LLpro software for academic research, please consider citing the accompanying publication:
Ehrmanntraut, Anton, Leonard Konle, and Fotis Jannidis. 2023. „LLpro: A Literary Language Processing Pipeline for German Narrative Text.“ In Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023), ed. Munir Georges, Aaricia Herygers, Annemarie Friedrich and Benjamin Roth, pp. 28–39. Ingolstadt, Germany: Association for Computational Linguistics. https://aclanthology.org/2023.konvens-main.3/
bibtex
@inproceedings{ehrmanntraut-etal-2023-llpro,
title = "{LL}pro: A Literary Language Processing Pipeline for {G}erman Narrative Texts",
author = "Ehrmanntraut, Anton and
Konle, Leonard and
Jannidis, Fotis",
editor = "Georges, Munir and
Herygers, Aaricia and
Friedrich, Annemarie and
Roth, Benjamin",
booktitle = "Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)",
date = "2023-09-18",
address = "Ingolstadt, Germany",
publisher = "Association for Computational Lingustics",
url = "https://aclanthology.org/2023.konvens-main.3/",
pages = "28--39"
}
License
In accordance with the license terms of ParZu+Zmorge (GPL v2), and of SoMeWeTa (GPL v3) the LLpro pipeline is licensed under the terms of GPL v3. See LICENSE.
NOTICE: The code of the ParZu parser located in resources/ParZu has been modified to be compatible with LLpro.
See git log -p df1e91a.. -- resources/ParZu for a summary of these changes.
NOTICE: Some subsystems and resources used by the LLpro pipeline have additional license terms:
- RNNTagger: see https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/Tagger-Licence
- SoMeWeTa model
german_web_social_media_2020-05-28.model: derived from the TIGER corpus; see https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/license/htmlicense.html
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
References
Akbik, Alan, Duncan Blythe, and Roland Vollgraf. 2018. “Contextual String Embeddings for Sequence Labeling.” In COLING 2018, 27th International Conference on Computational Linguistics, 1638–49.
Brunner, Annelen, Ngoc Duyen Tanja Tu, Lukas Weimer, and Fotis Jannidis. 2021. “To BERT or Not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of Four Types of Speech, Thought and Writing Representation.” In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), 2624:11. CEUR Workshop Proceedings. Zurich, Switzerland. http://ceur-ws.org/Vol-2624/paper5.pdf.
Krug, Markus, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, and Fotis Jannidis. 2017. “Description of a Corpus of Character References in German Novels - DROC [Deutsches ROman Corpus].” https://resolver.sub.uni-goettingen.de/purl?gro-2/108301.
Kurfalı, Murathan, and Mats Wirén. 2021. “Breaking the Narrative: Scene Segmentation Through Sequential Sentence Classification.” In Proceedings of the Shared Task on Scene Segmentation, edited by Albin Zehe, Leonard Konle, Lea Dümpelmann, Evelyn Gius, Svenja Guhr, Andreas Hotho, Fotis Jannidis, et al., 3001:49–53. CEUR Workshop Proceedings. Düsseldorf, Germany. http://ceur-ws.org/Vol-3001/#paper6.
Proisl, Thomas. 2018. “SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 665–70. Miyazaki, Japan: European Language Resources Association ELRA. http://www.lrec-conf.org/proceedings/lrec2018/pdf/49.pdf.
Proisl, Thomas, and Peter Uhrig. 2016. “SoMaJo: State-of-the-Art Tokenization for German Web and Social Media Texts.” In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task, 57–62. Berlin, Germany: Association for Computational Linguistics (ACL). http://aclweb.org/anthology/W16-2607.
———. 2019. “Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts.” In DATeCH, Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, 133–37. Brussels, Belgium: Association for Computing Machinery. https://www.cis.uni-muenchen.de/~schmid/papers/Datech2019.pdf.
Schröder, Fynn, Hans Ole Hatzel, and Chris Biemann. 2021. “Neural End-to-End Coreference Resolution for German in Different Domains.” In Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), 170–81. Düsseldorf, Germany: KONVENS 2021 Organizers. https://aclanthology.org/2021.konvens-1.15.
Schweter, Stefan, and Alan Akbik. 2021. “FLERT: Document-Level Features for Named Entity Recognition.” arXiv:2011.06993 [Cs], May. http://arxiv.org/abs/2011.06993.
Sennrich, Rico, and Beat Kunz. 2014. “Zmorge: A German Morphological Lexicon Extracted from Wiktionary.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 1063–7. Reykjavik, Iceland: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2014/pdf/116_Paper.pdf.
Sennrich, Rico, G. Schneider, M. Volk, M. Warin, C. Chiarcos, Richard Eckart de Castilho, and Manfred Stede. 2009. “A New Hybrid Dependency Parser for German.” In Proceedings of the GSCL Conference. Potsdam, Germany. https://doi.org/10.5167/UZH-25506.
Sennrich, Rico, Martin Volk, and Gerold Schneider. 2013. “Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-Tagging, and Morphological Analysis.” In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, 601–9. Hissar, Bulgaria: INCOMA Ltd. Shoumen, BULGARIA. https://www.aclweb.org/anthology/R13-1079.
Vauth, Michael, Hans Ole Hatzel, Evelyn Gius, and Chris Biemann. 2021. “Automated Event Annotation in Literary Texts.” In Proceedings of the Conference on Computational Humanities Research 2021, edited by Maud Ehrmann, Folgert Karsdorp, Melvin Wevers, Tara Lee Andrews, Manuel Burghardt, Mike Kestemont, Enrique Manjavacas, Michael Piotrowski, and Joris van Zundert, 2989:333–45. CEUR Workshop Proceedings. Amsterdam, the Netherlands. https://ceur-ws.org/Vol-2989/#short_paper18.
Owner
- Name: Computerphilologie Uni Würzburg
- Login: cophi-wue
- Kind: organization
- Location: Würzburg
- Website: http://www.germanistik.uni-wuerzburg.de/lehrstuehle/computerphilologie/startseite/
- Repositories: 14
- Profile: https://github.com/cophi-wue
Lehrstuhl für Computerphilologie, Julius-Maximilians-Universität Würzburg
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ehrmanntraut"
given-names: "Anton"
- family-names: "Konle"
given-names: "Leonard"
- family-names: "Jannidis"
given-names: "Fotis"
title: "LLpro: A Literary Language Processing Pipeline for German Narrative Texts"
date-released: "2022-09-05"
license: GPL-3.0-or-later
url: "https://github.com/cophi-wue/LLpro"
version: 0.1.0
preferred-citation:
type: conference-paper
authors:
- family-names: "Ehrmanntraut"
given-names: "Anton"
- family-names: "Konle"
given-names: "Leonard"
- family-names: "Jannidis"
given-names: "Fotis"
title: "LLpro: A Literary Language Processing Pipeline for German Narrative Texts"
collection-title: "Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)"
year: 2023
GitHub Events
Total
- Issues event: 2
- Watch event: 1
- Issue comment event: 1
- Push event: 10
- Pull request event: 2
- Fork event: 1
Last Year
- Issues event: 2
- Watch event: 1
- Issue comment event: 1
- Push event: 10
- Pull request event: 2
- Fork event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 1
- Total pull requests: 1
- Average time to close issues: 25 days
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: 25 days
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- thvitt (1)
Pull Request Authors
- thvitt (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- ubuntu 16.04 build
- pytorch/torchserve ${TORCHSERVE_TAG} build
- 121 dependencies
- nvidia/cuda 11.3.1-devel-ubuntu20.04 build
- graphviz *
- numpy *
- pyhocon *
- scikit-learn ==0.22.1
- tensorboard *
- tqdm ==4.56.0
- transformers ==4.2.1
- graphviz *
- numpy *
- pyhocon *
- scikit-learn ==0.22.1
- tensorboard *
- tqdm ==4.56.0
- transformers ==4.9.1
- 128 dependencies
- datasets ^2.12.0 develop
- de-dep-news-trf * develop
- pytest >=5.2 develop
- Cython ~=0.29
- SoMaJo ^2.2
- SoMeWeTa ~=1.8.1
- cython ~=0.29
- dill 0.3.6
- flair ^0.12.2
- more-itertools ^9.0.0
- multiprocessing_on_dill ^3.5.0-alpha.4
- omegaconf ^2.3.0
- overrides ^7.3.1
- pandas 1.3
- pexpect ^4.8.0
- pyhocon ^0.3.59
- python >=3.7.1,<3.11
- pytorch-transformers ^1.2.0
- regex ^2022.10.31
- spacy ~=3.5
- spacy-transformers ^1.2.1
- torch >=1.11