nlp-pipeline

A German NLP Pipeline for Lexicographic Use Cases

https://github.com/zentrum-lexikographie/nlp-pipeline

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.8%) to scientific vocabulary

Keywords

collocation-extraction german lemmatization morphological-analysis nlp sentence-scoring spacy
Last synced: 4 months ago · JSON representation ·

Repository

A German NLP Pipeline for Lexicographic Use Cases

Basic Info
  • Host: GitHub
  • Owner: zentrum-lexikographie
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 11.8 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 3
Topics
collocation-extraction german lemmatization morphological-analysis nlp sentence-scoring spacy
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation Zenodo

README.md

A German NLP Pipeline for Lexicographic Use Cases

Combining Off-the-Shelf and Custom Components

DOI

Installation

pip install -U pip setuptools
pip install git+https://github.com/zentrum-lexikographie/nlp-pipeline@v1.0.1
zdl-nlp-install-models

Add -f, if you would like to install CPU-optimized models and -d, if you have access to the DWDS edition of DWDSmor. In the latter case, log into your Hugging Face account beforehand, i. e.:

$ huggingface-cli login
[…]
$ zdl-nlp-install-models -d -f
2025-06-27 12:58:02,352 – INFO – Installed spaCy model (dist)
2025-06-27 12:58:11,420 – INFO – Installed spaCy model (lg)
2025-06-27 12:58:11,963 – INFO – Installed DWDSmor lemmatizer (open)
2025-06-27 12:58:12,309 – INFO – Installed DWDSmor lemmatizer (dwds)

Usage Example

Annotate a random sentence from a corpus of political speeches:

$ zdl-nlp-polspeech -s 0.01 -l 1 | zdl-nlp-annotate
# newdoc id = http://www.auswaertiges-amt.de/DE/Infoservice/Presse/Reden/2010/101014-Pieper-Dokkyo-Universität.html
# bibl = Cornelia Pieper. Rede Staatsministerin Pieper: "150 Jahre Wissenschaftsbeziehungen Deutschland-Japan – ein Schatz für die Zukunft". 2010-10-14. o.O.
# date = 2010-10-14
# entities = [["ORG", 13, 14]]
# gdex = 0.884521484375
# lang = de
# collocations = [["ADV", 2, 4], ["PP", 2, 8, 5], ["ATTR", 8, 7], ["SUBJA", 15, 13]]
1   Ich ich PRON    PPER    Case=Nom|Number=Sing|Person=1|PronType=Prs  2   nsubj   _   _
2   freue   freuen  VERB    VVFIN   Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin   0   root    _   _
3   mich    ich PRON    PRF Case=Acc|Number=Sing|Person=1|PronType=Prs|Reflex=Yes   2   expl:pv _   _
4   sehr    sehr    ADV ADV Degree=Pos  2   advmod  _   _
5   über   über   ADP APPR    AdpType=Prep|Case=Acc   8   case    _   _
6   den die DET ART Case=Acc|Definite=Def|Gender=Masc|Number=Sing|PronType=Art  8   det _   _
7   warmherzigen    warmherzig  ADJ ADJA    Case=Acc|Degree=Pos|Gender=Masc|Number=Sing 8   amod    _   _
8   Empfang Empfang NOUN    NN  Gender=Masc|Number=Sing 2   obl _   SpaceAfter=No
9   ,   ,   PUNCT   $,  PunctType=Comm  15  punct   _   _
10  den die PRON    PRELS   Case=Acc|Gender=Masc|Number=Sing|PronType=Dem,Rel   15  obj _   _
11  mir ich PRON    PPER    Case=Dat|Number=Sing|Person=1|PronType=Prs  15  obl:arg _   _
12  die die DET ART Case=Nom|Definite=Def|Gender=Fem|Number=Sing|PronType=Art   13  det _   _
13  Dokkyo  Dokkyo  NOUN    NN  _   15  nsubj   _   _
14  Universität    Universität    NOUN    NN  Gender=Fem|Number=Sing  13  appos   _   _
15  bereitet    bereiten    VERB    VVFIN   Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   8   acl _   SpaceAfter=No
16  .   .   PUNCT   $.  PunctType=Peri  2   punct   _   _

Development Setup

pip install -U pip pip-tools setuptools
pip install -e .[dev]

Analyze TEI schema (element classes)

(cd tei-schema && clojure -X:extract) >zdl_nlp/tei_schema.json

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser Public License for more details.

You should have received a copy of the GNU Lesser Public License along with this program. If not, see https://www.gnu.org/licenses/.

Owner

  • Name: Zentrum für digitale Lexikographie der deutschen Sprache
  • Login: zentrum-lexikographie
  • Kind: organization

Citation (CITATION.cff)

cff-version: "1.2.0"
title: "A German NLP Pipeline for Lexicographic Use Cases"
version: 1.0.1
license: "GPL-3.0"
type: software
abstract: "A pipeline adding linguistic annotations to texts, combining off-the-shelf as well as custom components and tailored to lexicographic use cases."
message: "If you use this software, please cite it as below."
authors:
  - given-names: Luise
    family-names: Köhler
    affiliation: Berlin-Brandenburg Academy of Sciences and Humanities
    orcid: "https://orcid.org/0000-0001-5144-1920"
  - given-names: Gregor
    family-names: Middell
    affiliation: Berlin-Brandenburg Academy of Sciences and Humanities
    orcid: "https://orcid.org/0009-0000-9256-4687"
keywords:
  - collocation extraction
  - computational linguistics
  - german
  - lexicography
  - morphological analysis
  - nlp
  - sentence-scoring
  - spacy

GitHub Events

Total
  • Release event: 1
  • Issue comment event: 2
  • Public event: 1
  • Push event: 28
  • Pull request event: 6
  • Create event: 3
Last Year
  • Release event: 1
  • Issue comment event: 2
  • Public event: 1
  • Push event: 28
  • Pull request event: 6
  • Create event: 3

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 0
  • Total pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 5 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 5 minutes
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • gremid (4)
Top Labels
Issue Labels
Pull Request Labels
autorelease: pending (4)