gdex

GDEX – Good Dictionary Examples – Rule-based Sentence Scoring Algorithm

https://github.com/zentrum-lexikographie/gdex

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Keywords

german lexicography nlp sentence-scoring

Last synced: 11 months ago · JSON representation ·

Repository

GDEX – Good Dictionary Examples – Rule-based Sentence Scoring Algorithm

Basic Info

Host: GitHub
Owner: zentrum-lexikographie
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 712 KB

Statistics

Stars: 3
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 12

Topics

german lexicography nlp sentence-scoring

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License Citation Zenodo

GDEX

Good Dictionary Examples – Rule-based Sentence Scoring Algorithm

This Python package provides a GDEX-based algorithm for evaluating sentences with regard to their suitability as good examples in dictionaries. It applies a numeric score between zero and one to sentences which have been preprocessed with the NLP tool spaCy. The score is computed by taking several configurable criteria into account, firstly knock-out criteria which have to be fulfilled in order to reach a score above 0.5, as well as gradual criteria that factor into a score.

Among the knock-out criteria are

the character set of a sentence not containing any invalid ones (i. e. control characters),
properly parsed sentences with punctuation at the end, and
the existence of a finite verb and a subject, annotated and related in a sentence's dependency parse tree.

Among the gradual criteria are

the absence of blacklisted words (i. e. vulgar or obscene),
the absence of rare characters or those normally not available on a keyboard,
the absence of named entities,
the absence of deictic expressions,
an optimal length of the sentence,
a whitelist-based coverage test, i. e. for penalizing usage of rare lemmata, and
the absence of subordinate clauses / the headword being part of a main clause.

Installation

gdex can be installed as a package from its GitHub source repository:

sh pip install git+https://github.com/zentrum-lexikographie/gdex.git@v1.5.1

For development, clone it from GitHub and install it locally, including optional dependencies:

sh pip install -e .[dev]

Usage

``` python-console

import zdlspacy import gdex nlp = zdlspacy.load() [s..gdex for s in gdex.dehdt(nlp("Achtung! Das ist ein toller Test.")).sents] [0.0, 0.5968749999999999] ```

Testing

Run tests, including calculation of code coverage:

sh coverage run -m pytest

Acknowledgements

This package was initially developed as part of the EVIDENCE project and funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation, GU 798/27-1; GE 1119/11-1). Between August 2023 and October 2024, it has been maintained by Ulf Hamster.

This implementation makes use of VulGer, a lexicon covering words from the lower end of the German language register — terms typically considered rough, vulgar, or obscene. VulGer is used under the terms of the CC-BY-SA license.

Bibliography

Rychlý, Pavel, Miloš Husák, Adam Kilgarriff, Michael Rundell, und Katy McAdam. GDEX: Automatically Finding Good Dictionary Examples in a Corpus. Institut Universitari de Lingüística Aplicada, 2008. https://is.muni.cz/publication/772821/en/GDEX-Automatically-finding-good-dictionary-examples-in-a-corpus/Rychly-Husak-Kilgarriff-Rundell.
Didakowski, Jörg, Lothar Lemnitzer, und Alexander Geyken. „Automatic Example Sentence Extraction for a Contemporary German Dictionary“, 343–49, 2012. https://euralex.org/publications/automatic-example-sentence-extraction-for-a-contemporary-german-dictionary/.
Eder, Elisabeth, Ulrike Krieg-Holz, und Udo Hahn. „At the Lower End of Language—Exploring the Vulgar and Obscene Side of German“. In Proceedings of the Third Workshop on Abusive Language Online, herausgegeben von Sarah T. Roberts, Joel Tetreault, Vinodkumar Prabhakaran, und Zeerak Waseem, 119–28. Florence, Italy: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/W19-3513.

Owner

Name: Zentrum für digitale Lexikographie der deutschen Sprache
Login: zentrum-lexikographie
Kind: organization

Repositories: 1
Profile: https://github.com/zentrum-lexikographie

Citation (CITATION.cff)

cff-version: "1.2.0"
title: "Good Dictionary Examples – A Rule-based Sentence Scoring Algorithm Implemented in Python"
version: 1.5.1
license: "Apache-2.0"
type: software
abstract: "Python package providing a GDEX-based algorithm for evaluating sentences with regard to their suitability as good examples in dictionaries"
message: "If you use this software, please cite it as below."
authors:
  - given-names: Ulf
    family-names: Hamster
    affiliation: University of Bremen
    orcid: "https://orcid.org/0000-0002-0440-4868"
  - given-names: Gregor
    family-names: Middell
    affiliation: Berlin-Brandenburg Academy of Sciences and Humanities
    orcid: "https://orcid.org/0009-0000-9256-4687"
  - given-names: Natalie
    family-names: Sürmeli
    affiliation: Berlin-Brandenburg Academy of Sciences and Humanities
    orcid: "https://orcid.org/0009-0001-6896-6165"
keywords:
  - computational linguistics
  - nlp
  - german
  - lexicography
  - sentence-scoring

GitHub Events

Total

Create event: 16
Issues event: 2
Release event: 11
Watch event: 4
Delete event: 5
Issue comment event: 9
Push event: 53
Pull request review event: 4
Pull request review comment event: 2
Pull request event: 26

Last Year

Create event: 16
Issues event: 2
Release event: 11
Watch event: 4
Delete event: 5
Issue comment event: 9
Push event: 53
Pull request review event: 4
Pull request review comment event: 2
Pull request event: 26

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 1
Total pull requests: 11
Average time to close issues: 28 days
Average time to close pull requests: 6 days
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 11
Average time to close issues: 28 days
Average time to close pull requests: 6 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Natalie-T-E (1)

Pull Request Authors

gremid (11)
Natalie-T-E (3)

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

autorelease: pending (9)

Dependencies

.github/workflows/release-please.yml actions

googleapis/release-please-action v4 composite

.github/workflows/test.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

pyproject.toml pypi

spacy >=3.7

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science