simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Keywords

corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist

Last synced: 11 months ago · JSON representation ·

Repository

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Basic Info

Host: GitHub
Owner: adbar
License: mit
Language: Python
Default Branch: main
Homepage: https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html
Size: 729 MB

Statistics

Stars: 166
Watchers: 5
Forks: 14
Open Issues: 13
Releases: 18

Topics

corpus-tools language-detection language-identification lemmatiser lemmatization lemmatizer low-resource-nlp morphological-analysis nlp tokenization tokenizer wordlist

Created over 5 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License Citation Support

Simplemma: a simple multilingual lemmatizer for Python

Purpose

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word\'s lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task can be crucial in fields such as information retrieval and NLP.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it does not need morphosyntactic information and can process a raw series of tokens or even a text with its built-in tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, in low-resource contexts, for educational purposes, or as a baseline system for lemmatization and morphological analysis.

Currently, 49 languages are partly or fully supported (see table below).

Installation

The current library is written in pure Python with no dependencies: pip install simplemma

pip3 where applicable
pip install -U simplemma for updates
pip install git+https://github.com/adbar/simplemma for the cutting-edge version

The last version supporting Python 3.6 and 3.7 is simplemma==1.0.0.

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

``` python

import simplemma

get a word

myword = 'masks'

decide which language to use and apply it on a word form

simplemma.lemmatize(myword, lang='en') 'mask'

grab a list of tokens

mytokens = ['Hier', 'sind', 'Vaccines'] for token in mytokens: simplemma.lemmatize(token, lang='de') 'hier' 'sein' 'Vaccines'

list comprehensions can be faster

[simplemma.lemmatize(t, lang='de') for t in mytokens] ['hier', 'sein', 'Vaccines'] ```

Chaining languages

Chaining several languages can improve coverage, they are used in sequence:

``` python

from simplemma import lemmatize lemmatize('Vaccines', lang=('de', 'en')) 'vaccine' lemmatize('spaghettis', lang='it') 'spaghettis' lemmatize('spaghettis', lang=('it', 'fr')) 'spaghetti' lemmatize('spaghetti', lang=('it', 'fr')) 'spaghetto' ```

Greedier decomposition

For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the greedy parameter to True.

This option also triggers a stronger reduction through an additional iteration of the search algorithm, e.g. \"angekündigten\" → \"angekündigt\" (standard) → \"ankündigen\" (greedy). In some cases it may be closer to stemming than to lemmatization.

``` python

same example as before, comes to this result in one step

simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True) 'spaghetto'

German case described above

simplemma.lemmatize('angekündigten', lang='de', greedy=True) 'ankündigen' # 2 steps: reduction to infinitive verb simplemma.lemmatize('angekündigten', lang='de', greedy=False) 'angekündigt' # 1 step: reduction to past participle ```

is_known()

The additional function is_known() checks if a given word is present in the language data:

``` python

from simplemma import isknown isknown('spaghetti', lang='it') True ```

Tokenization

A simple tokenization function is provided for convenience:

``` python

from simplemma import simpletokenizer simpletokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.') ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']

use iterator instead

simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True) ```

The functions text_lemmatizer() and lemma_iterator() chain tokenization and lemmatization. They can take greedy (affecting lemmatization) and silent (affecting errors and logging) as arguments:

``` python

from simplemma import textlemmatizer sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.' textlemmatizer(sentence, lang='pt')

caveat: desejo is also a noun, should be desejar here

['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']

same principle, returns a generator and not a list

from simplemma import lemmaiterator lemmaiterator(sentence, lang='pt') ```

Caveats

``` python

don't expect too much though

this diminutive form isn't in the model data

simplemma.lemmatize('spaghettini', lang='it') 'spaghettini' # should read 'spaghettino'

the algorithm cannot choose between valid alternatives yet

simplemma.lemmatize('son', lang='es') 'son' # valid common name, but what about the verb form? ```

As the focus lies on overall coverage, some short frequent words (typically: pronouns and conjunctions) may need post-processing, this generally concerns a few dozens of tokens per language.

The current absence of morphosyntactic information is an advantage in terms of simplicity. However, it is also an impassable frontier regarding lemmatization accuracy, for example when it comes to disambiguating between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, simplemma often does not change such input words.

The greedy algorithm seldom produces invalid forms. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages, where it can also act as a linguistically motivated stemmer.

Bug reports over the issues page are welcome.

Language detection

Language detection works by providing a text and tuple lang consisting of a series of languages of interest. Scores between 0 and 1 are returned.

The lang_detector() function returns a list of language codes along with their corresponding scores, appending \"unk\" for unknown or out-of-vocabulary words. The latter can also be calculated by using the function in_target_language() which returns a ratio.

``` python

import necessary functions

from simplemma import intargetlanguage, lang_detector

language detection

lang_detector('"Exoplaneta, též extrasolární planeta, je planeta obíhající kolem jiné hvězdy než kolem Slunce."', lang=("cs", "sk")) [("cs", 0.75), ("sk", 0.125), ("unk", 0.25)]

proportion of known words

intargetlanguage("opera post physica posita (τὰ μετὰ τὰ φυσικά)", lang="la") 0.5 ```

The greedy argument (extensive in past software versions) triggers use of the greedier decomposition algorithm described above, thus extending word coverage and recall of detection at the potential cost of a lesser accuracy.

Advanced usage via classes

The functions described above are suitable for simple usage, but you can have more control by instantiating Simplemma classes and calling their methods instead. Lemmatization is handled by the Lemmatizer class, while language detection is handled by the LanguageDetector class. These in turn rely on different lemmatization strategies, which are implementations of the LemmatizationStrategy protocol. The DefaultStrategy implementation uses a combination of different strategies, one of which is DictionaryLookupStrategy. It looks up tokens in a dictionary created by a DictionaryFactory.

For example, it is possible to conserve RAM by limiting the number of cached language dictionaries (default: 8) by creating a custom DefaultDictionaryFactory with a specific cache_max_size setting, creating a DefaultStrategy using that factory, and then creating a Lemmatizer and/or a LanguageDetector using that strategy:

``` python

import necessary classes

from simplemma import LanguageDetector, Lemmatizer from simplemma.strategies import DefaultStrategy from simplemma.strategies.dictionaries import DefaultDictionaryFactory

LANGCACHESIZE = 5 # How many language dictionaries to keep in memory at once (max)

dictionaryfactory = DefaultDictionaryFactory(cachemaxsize=LANGCACHESIZE) lemmatizationstrategy = DefaultStrategy(dictionaryfactory=dictionaryfactory)

lemmatize using the above customized strategy

lemmatizer = Lemmatizer(lemmatizationstrategy=lemmatizationstrategy) lemmatizer.lemmatize('doughnuts', lang='en') 'doughnut'

detect languages using the above customized strategy

languagedetector = LanguageDetector('la', lemmatizationstrategy=lemmatizationstrategy) languagedetector.proportionintarget_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)") 0.5 ```

For more information see the extended documentation.

Reducing memory usage

Simplemma provides an alternative solution for situations where low memory usage and fast initialization time are more important than lemmatization and language detection performance. This solution uses a DictionaryFactory that employs a trie as its underlying data structure, rather than a Python dict.

The TrieDictionaryFactory reduces memory usage by an average of 20x and initialization time by 100x, but this comes at the cost of potentially reducing performance by 50% or more, depending on the specific usage.

To use the TrieDictionaryFactory you have to install Simplemma with the marisa-trie extra dependency (available from version 1.1.0):

pip install simplemma[marisa-trie]

Then you have to create a custom strategy using the TrieDictionaryFactory and use that for Lemmatizer and LanguageDetector instances:

``` python

from simplemma import LanguageDetector, Lemmatizer from simplemma.strategies import DefaultStrategy from simplemma.strategies.dictionaries import TrieDictionaryFactory

lemmatizationstrategy = DefaultStrategy(dictionaryfactory=TrieDictionaryFactory())

lemmatizer = Lemmatizer(lemmatizationstrategy=lemmatizationstrategy) lemmatizer.lemmatize('doughnuts', lang='en') 'doughnut'

languagedetector = LanguageDetector('la', lemmatizationstrategy=lemmatizationstrategy) languagedetector.proportionintarget_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)") 0.5 ```

While memory usage and initialization time when using the TrieDictionaryFactory are significantly lower compared to the DefaultDictionaryFactory, that's only true if the trie dictionaries are available on disk. That's not the case when using the TrieDictionaryFactory for the first time, as Simplemma only ships the dictionaries as Python dicts. The trie dictionaries have to be generated once from the Python dicts. That happens on-the-fly when using the TrieDictionaryFactory for the first time for a language and will take a few seconds and use as much memory as loading the Python dicts for the language requires. For further invocations the trie dictionaries get cached on disk.

If the computer supposed to run Simplemma doesn't have enough memory to generate the trie dictionaries, they can also be generated on another computer with the same CPU architecture and copied over to the cache directory.

Supported languages

The following languages are available, identified by their BCP 47 language tag, which typically corresponds to the ISO 639-1 code. If no such code exists, a ISO 639-3 code is used instead.

Available languages (2022-01-20):

| Code | Language | Forms (10³) | Acc. | Comments | | ---- | -------- | ----------- | ---- | -------- | | ast | Asturian | 124 | | bg | Bulgarian | 204 | | ca | Catalan | 579 | | cs | Czech | 187 | 0.89 | on UD CS-PDT | cy | Welsh | 360 | | da | Danish | 554 | 0.92 | on UD DA-DDT, alternative: lemmy | de | German | 675 | 0.95 | on UD DE-GSD, see also German-NLP list | el | Greek | 181 | 0.88 | on UD EL-GDT | en | English | 131 | 0.94 | on UD EN-GUM, alternative: LemmInflect | enm | Middle English | 38 | es | Spanish | 665 | 0.95 | on UD ES-GSD | et | Estonian | 119 | | low coverage | fa | Persian | 12 | | experimental | fi | Finnish | 3,199 | | see this benchmark | fr | French | 217 | 0.94 | on UD FR-GSD | ga | Irish | 372 | gd | Gaelic | 48 | gl | Galician | 384 | gv | Manx | 62 | hbs | Serbo-Croatian | 656 | | Croatian and Serbian lists to be added later | hi | Hindi | 58 | | experimental | hu | Hungarian | 458 | hy | Armenian | 246 | id | Indonesian | 17 | 0.91 | on UD ID-CSUI | is | Icelandic | 174 | it | Italian | 333 | 0.93 | on UD IT-ISDT | ka | Georgian | 65 | la | Latin | 843 | lb | Luxembourgish | 305 | lt | Lithuanian | 247 | lv | Latvian | 164 | mk | Macedonian | 56 | ms | Malay | 14 | nb | Norwegian (Bokmål) | 617 | nl | Dutch | 250 | 0.92 | on UD-NL-Alpino | nn | Norwegian (Nynorsk) | 56 | pl | Polish | 3,211 | 0.91 | on UD-PL-PDB | pt | Portuguese | 924 | 0.92 | on UD-PT-GSD | ro | Romanian | 311 | ru | Russian | 595 | | alternative: pymorphy2 | se | Northern Sámi | 113 | sk | Slovak | 818 | 0.92 | on UD SK-SNK | sl | Slovene | 136 | sq | Albanian | 35 | sv | Swedish | 658 | | alternative: lemmy | sw | Swahili | 10 | | experimental | tl | Tagalog | 32 | | experimental | tr | Turkish | 1,232 | 0.89 | on UD-TR-Boun | uk | Ukrainian | 370 | | alternative: pymorphy2

Languages marked as having low coverage may be better suited to language-specific libraries, but Simplemma can still provide limited functionality. Where possible, open-source Python alternatives are referenced.

Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. See eval/ folder of the code repository for more information.

This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.

Speed

The following orders of magnitude are provided for reference only and were measured on an old laptop to establish a lower bound:

Tokenization: > 1 million tokens/sec
Lemmatization: > 250,000 words/sec

Using the most recent Python version (i.e. with pyenv) can make the package run faster.

Roadmap

[x] Add further lemmatization lists
[ ] Grammatical categories as option
[ ] Function as a meta-package?
[ ] Integrate optional, more complex models?

Credits and licenses

The software is licensed under the MIT license. For information on the licenses of the linguistic information databases, see the licenses folder.

The surface lookups (non-greedy mode) rely on lemmatization lists derived from the following sources, listed in order of relative importance:

Lemmatization lists by Michal Měchura (Open Database License)
Wiktionary entries packaged by the Kaikki project
FreeLing project
spaCy lookups data
Unimorph Project
Wikinflection corpus by Eleni Metheniti (CC BY 4.0 License)

Contributions

This package has been first created and published by Adrien Barbaresi. It has then benefited from extensive refactoring by Juanjo Diaz (especially the new classes). See the full list of contributors to the repository.

Feel free to contribute, notably by filing issues for feedback, bug reports, or links to further lemmatization lists, rules and tests.

Contributions by pull requests ought to follow the following conventions: code style with black, type hinting with mypy, included tests with pytest.

References

To cite this software:

Barbaresi A. (year). Simplemma: a simple multilingual lemmatizer for Python [Computer software] (Version version number). Berlin, Germany: Berlin-Brandenburg Academy of Sciences. Available from https://github.com/adbar/simplemma DOI: 10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

Barbaresi, A., & Hein, K. (2017). Data-driven identification of German phrasal compounds. In International Conference on Text, Speech, and Dialogue Springer, pp. 192-200.
Barbaresi, A. (2016). An unsupervised morphological criterion for discriminating similar languages. In 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2016), Association for Computational Linguistics, pp. 212-220.
Barbaresi, A. (2016). Bootstrapped OCR error detection for a less-resourced language variant. In 13th Conference on Natural Language Processing (KONVENS 2016), pp. 21-26.

Owner

Name: Adrien Barbaresi
Login: adbar
Kind: user
Location: Berlin
Company: Berlin-Brg. Academy of Sciences (BBAW)

Website: adrien.barbaresi.eu
Twitter: adbarbaresi
Repositories: 37
Profile: https://github.com/adbar

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

Citation (CITATION.cff)

authors:
  - family-names: Barbaresi
    given-names: Adrien
cff-version: 1.2.0
identifiers:
  - description: "This is the collection of archived snapshots of all versions of Simplemma"
    type: doi
    value: 10.5281/zenodo.4673264
message: "If you use this software, please cite it using these metadata."
title: "Simplemma"

GitHub Events

Total

Issues event: 8
Watch event: 24
Delete event: 8
Issue comment event: 20
Push event: 21
Pull request review event: 1
Pull request review comment event: 1
Pull request event: 18
Fork event: 3
Create event: 9

Last Year

Issues event: 8
Watch event: 24
Delete event: 8
Issue comment event: 20
Push event: 21
Pull request review event: 1
Pull request review comment event: 1
Pull request event: 18
Fork event: 3
Create event: 9

Committers

Last synced: about 2 years ago

All Time

Total Commits: 217
Total Committers: 6
Avg Commits per committer: 36.167
Development Distribution Score (DDS): 0.203

Past Year

Commits: 18
Committers: 4
Avg Commits per committer: 4.5
Development Distribution Score (DDS): 0.5

Top Committers

Name	Email	Commits
Adrien Barbaresi	b**i@b**e	173
Juanjo Diaz	j**o@g**m	38
Osma Suominen	o**n@h**i	2
sourcery-ai[bot]	5****]	2
Daniel Roschka	d**n@p**e	1
1over137	2****7	1

Committer Domains (Top 20 + Academic)

phoenitydawn.de: 1 helsinki.fi: 1 bbaw.de: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 70
Total pull requests: 97
Average time to close issues: about 2 months
Average time to close pull requests: 14 days
Total issue authors: 18
Total pull request authors: 8
Average comments per issue: 3.33
Average comments per pull request: 3.62
Merged pull requests: 81
Bot issues: 0
Bot pull requests: 7

Past Year

Issues: 10
Pull requests: 20
Average time to close issues: 12 days
Average time to close pull requests: 2 days
Issue authors: 4
Pull request authors: 5
Average comments per issue: 2.6
Average comments per pull request: 1.4
Merged pull requests: 18
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

adbar (22)
juanjoDiaz (14)
osma (10)
1over137 (4)
GrazingScientist (2)
bartdpt (1)
zeeyado (1)
erip (1)
joprice (1)
BLKSerene (1)
FrogInDizzy (1)
axel584 (1)
aartbastiaan (1)
martaaliu (1)
hivaze (1)

Pull Request Authors

adbar (53)
juanjoDiaz (52)
sourcery-ai[bot] (7)
Dunedan (4)
osma (3)
shrijayan (2)
1over137 (2)
axel584 (1)

Top Labels

Issue Labels

question (21) enhancement (15) bug (11) documentation (7) maintenance (4)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 17,913 last-month

Total dependent packages: 6
(may contain duplicates)
Total dependent repositories: 25
(may contain duplicates)
Total versions: 36
Total maintainers: 1

pypi.org: simplemma

A lightweight toolkit for multilingual lemmatization and language detection.

Homepage: https://github.com/adbar/simplemma
Documentation: https://simplemma.readthedocs.io/
License: MIT License
Latest release: 1.1.2
published over 1 year ago

Versions: 18
Dependent Packages: 6
Dependent Repositories: 25
Downloads: 17,913 Last month

Rankings

Dependent packages count: 1.4%

Downloads: 2.9%

Dependent repos count: 2.9%

Average: 5.5%

Stargazers count: 7.1%

Forks count: 13.3%

Maintainers (1)

adbar

Last synced: 11 months ago

proxy.golang.org: github.com/adbar/simplemma

Documentation: https://pkg.go.dev/github.com/adbar/simplemma#section-documentation
License: mit
Latest release: v1.1.2
published over 1 year ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 11 months ago

simplemma

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Simplemma: a simple multilingual lemmatizer for Python

Purpose

Installation

Usage

Word-by-word

get a word

decide which language to use and apply it on a word form

grab a list of tokens

list comprehensions can be faster

Chaining languages

Greedier decomposition

same example as before, comes to this result in one step

German case described above

is_known()

Tokenization

use iterator instead

caveat: desejo is also a noun, should be desejar here

same principle, returns a generator and not a list

Caveats

don't expect too much though

this diminutive form isn't in the model data

the algorithm cannot choose between valid alternatives yet

Language detection

import necessary functions

language detection

proportion of known words

Advanced usage via classes

import necessary classes

lemmatize using the above customized strategy

detect languages using the above customized strategy

Reducing memory usage

Supported languages

Speed

Roadmap

Credits and licenses

Contributions

Other solutions

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: simplemma

Rankings

Maintainers (1)

proxy.golang.org: github.com/adbar/simplemma

Rankings

Dependencies