gdex

GDEX – Good Dictionary Examples – Rule-based Sentence Scoring Algorithm

https://github.com/zentrum-lexikographie/gdex

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary

Keywords

german lexicography nlp sentence-scoring
Last synced: 6 months ago · JSON representation ·

Repository

GDEX – Good Dictionary Examples – Rule-based Sentence Scoring Algorithm

Basic Info
  • Host: GitHub
  • Owner: zentrum-lexikographie
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 712 KB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 0
  • Open Issues: 1
  • Releases: 12
Topics
german lexicography nlp sentence-scoring
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation Zenodo

README.md

GDEX

Good Dictionary Examples – Rule-based Sentence Scoring Algorithm

DOI

This Python package provides a GDEX-based algorithm for evaluating sentences with regard to their suitability as good examples in dictionaries. It applies a numeric score between zero and one to sentences which have been preprocessed with the NLP tool spaCy. The score is computed by taking several configurable criteria into account, firstly knock-out criteria which have to be fulfilled in order to reach a score above 0.5, as well as gradual criteria that factor into a score.

Among the knock-out criteria are

  • the character set of a sentence not containing any invalid ones (i. e. control characters),
  • properly parsed sentences with punctuation at the end, and
  • the existence of a finite verb and a subject, annotated and related in a sentence's dependency parse tree.

Among the gradual criteria are

  • the absence of blacklisted words (i. e. vulgar or obscene),
  • the absence of rare characters or those normally not available on a keyboard,
  • the absence of named entities,
  • the absence of deictic expressions,
  • an optimal length of the sentence,
  • a whitelist-based coverage test, i. e. for penalizing usage of rare lemmata, and
  • the absence of subordinate clauses / the headword being part of a main clause.

Installation

gdex can be installed as a package from its GitHub source repository:

sh pip install git+https://github.com/zentrum-lexikographie/gdex.git@v1.5.1

For development, clone it from GitHub and install it locally, including optional dependencies:

sh pip install -e .[dev]

Usage

``` python-console

import zdlspacy import gdex nlp = zdlspacy.load() [s..gdex for s in gdex.dehdt(nlp("Achtung! Das ist ein toller Test.")).sents] [0.0, 0.5968749999999999] ```

Testing

Run tests, including calculation of code coverage:

sh coverage run -m pytest

Acknowledgements

This package was initially developed as part of the EVIDENCE project and funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, GU 798/27-1; GE 1119/11-1). Between August 2023 and October 2024, it has been maintained by Ulf Hamster.

This implementation makes use of VulGer, a lexicon covering words from the lower end of the German language register — terms typically considered rough, vulgar, or obscene. VulGer is used under the terms of the CC-BY-SA license.

Bibliography

  • Rychlý, Pavel, Miloš Husák, Adam Kilgarriff, Michael Rundell, und Katy McAdam. GDEX: Automatically Finding Good Dictionary Examples in a Corpus. Institut Universitari de Lingüística Aplicada, 2008. https://is.muni.cz/publication/772821/en/GDEX-Automatically-finding-good-dictionary-examples-in-a-corpus/Rychly-Husak-Kilgarriff-Rundell.
  • Didakowski, Jörg, Lothar Lemnitzer, und Alexander Geyken. „Automatic Example Sentence Extraction for a Contemporary German Dictionary“, 343–49, 2012. https://euralex.org/publications/automatic-example-sentence-extraction-for-a-contemporary-german-dictionary/.
  • Eder, Elisabeth, Ulrike Krieg-Holz, und Udo Hahn. „At the Lower End of Language—Exploring the Vulgar and Obscene Side of German“. In Proceedings of the Third Workshop on Abusive Language Online, herausgegeben von Sarah T. Roberts, Joel Tetreault, Vinodkumar Prabhakaran, und Zeerak Waseem, 119–28. Florence, Italy: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/W19-3513.

Owner

  • Name: Zentrum für digitale Lexikographie der deutschen Sprache
  • Login: zentrum-lexikographie
  • Kind: organization

Citation (CITATION.cff)

cff-version: "1.2.0"
title: "Good Dictionary Examples – A Rule-based Sentence Scoring Algorithm Implemented in Python"
version: 1.5.1
license: "Apache-2.0"
type: software
abstract: "Python package providing a GDEX-based algorithm for evaluating sentences with regard to their suitability as good examples in dictionaries"
message: "If you use this software, please cite it as below."
authors:
  - given-names: Ulf
    family-names: Hamster
    affiliation: University of Bremen
    orcid: "https://orcid.org/0000-0002-0440-4868"
  - given-names: Gregor
    family-names: Middell
    affiliation: Berlin-Brandenburg Academy of Sciences and Humanities
    orcid: "https://orcid.org/0009-0000-9256-4687"
  - given-names: Natalie
    family-names: Sürmeli
    affiliation: Berlin-Brandenburg Academy of Sciences and Humanities
    orcid: "https://orcid.org/0009-0001-6896-6165"
keywords:
  - computational linguistics
  - nlp
  - german
  - lexicography
  - sentence-scoring

GitHub Events

Total
  • Create event: 16
  • Issues event: 2
  • Release event: 11
  • Watch event: 4
  • Delete event: 5
  • Issue comment event: 9
  • Push event: 53
  • Pull request review event: 4
  • Pull request review comment event: 2
  • Pull request event: 26
Last Year
  • Create event: 16
  • Issues event: 2
  • Release event: 11
  • Watch event: 4
  • Delete event: 5
  • Issue comment event: 9
  • Push event: 53
  • Pull request review event: 4
  • Pull request review comment event: 2
  • Pull request event: 26

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 11
  • Average time to close issues: 28 days
  • Average time to close pull requests: 6 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 11
  • Average time to close issues: 28 days
  • Average time to close pull requests: 6 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Natalie-T-E (1)
Pull Request Authors
  • gremid (11)
  • Natalie-T-E (3)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
autorelease: pending (9)

Dependencies

.github/workflows/release-please.yml actions
  • googleapis/release-please-action v4 composite
.github/workflows/test.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
pyproject.toml pypi
  • spacy >=3.7