UralicNLP
UralicNLP: An NLP Library for Uralic Languages - Published in JOSS (2019)
Science Score: 98.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
An NLP library for Uralic languages such as Finnish, Skolt Sami, Moksha and so on. Also supporting some non-Uralic languages such as Spanish, French, Arabic, Swedish, Norwegian, Russian and English. LLMs, FSTs and More!
Basic Info
- Host: GitHub
- Owner: mikahama
- License: apache-2.0
- Language: Python
- Default Branch: master
- Homepage: http://uralicnlp.com/
- Size: 437 KB
Statistics
- Stars: 82
- Watchers: 5
- Forks: 7
- Open Issues: 0
- Releases: 14
Topics
Metadata Files
README.md
UralicNLP
Natural language processing for many languages
UralicNLP can produce morphological analyses, generate morphological forms, lemmatize words and give lexical information about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami. Currently, UralicNLP uses stable builds for the supported languages.
See the catalog of supported languages
Some of the supported languages: 🇸🇦 🇪🇸 🇮🇹 🇵🇹 🇩🇪 🇫🇷 🇳🇱 🇬🇧 🇷🇺 🇫🇮 🇸🇪 🇳🇴 🇩🇰 🇱🇻 🇪🇪
Check out UralicGUI - a graphical user interface for UralicNLP.
☕ Check out UralicNLP official Java version
♯ Check out UralicNLP official C# version
Installation
The library can be installed from PyPi.
pip install uralicNLP
If you want to use the Constraint Grammar features (from uralicNLP.cg3 import Cg3), you will also need to install VISL CG-3.
Large language models (LLMs)
UralicNLP supports a wide range of LLMs and it can even embed text in some endangered languages Check out LLMs.
UralicNLP can cluster texts into semantically similar categories. Learn more about clustering.
List supported languages
The API is under constant development and new languages will be added to the nightly builds system. That's why UralicNLP provides a functionality for looking up the list of currently supported languages. The method returns 3 letter ISO codes for the languages.
from uralicNLP import uralicApi
uralicApi.supported_languages()
>>{'cg': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'ron', 'olo', 'bxr', 'hun', 'crk', 'chr', 'vep', 'deu', 'mrj', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'tat', 'smj'], 'dictionary': ['vot', 'lav', 'rus', 'est', 'nob', 'ron', 'olo', 'hun', 'koi', 'chr', 'deu', 'mrj', 'sjd', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'fkv', 'mhr', 'kpv', 'sme', 'sje', 'hdn', 'fin', 'mns', 'mdf', 'vro', 'udm', 'smj'], 'morph': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'swe', 'ron', 'eng', 'olo', 'bxr', 'hun', 'koi', 'crk', 'chr', 'vep', 'deu', 'mrj', 'ara', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'mhr', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'vro', 'udm', 'tat', 'smj']}
The dictionary key lists the languages that are supported by the lexical lookup, whereas morph lists the languages that have morphological FSTs and cg lists the languages that have a CG.
Download models
On the command line:
python -m uralicNLP.download --languages fin eng
From python code:
from uralicNLP import uralicApi
uralicApi.download("fin")
When models are installed, generate(), analyze() and lemmatize() methods will automatically use them instead of the server side API. More information about the models.
Lemmatize words
A word form can be lemmatized with UralicNLP. This does not do any disambiguation but rather returns a list of all the possible lemmas.
from uralicNLP import uralicApi
uralicApi.lemmatize("вирев", "myv")
>>['вирев', 'вирь']
uralicApi.lemmatize("luutapiiri", "fin", word_boundaries=True)
>>['luuta|piiri', 'luu|tapiiri']
An example of lemmatizing the word вирев in Erzya (myv). By default, a descriptive analyzer is used. Use uralicApi.lemmatize("вирев", "myv", descriptive=False) for a non-descriptive analyzer. If word_boundaries is set to True, the lemmatizer will mark word boundaries with a |.
Morphological analysis
Apart from just getting the lemmas, it's also possible to perform a complete morphological analysis.
from uralicNLP import uralicApi
uralicApi.analyze("voita", "fin")
>>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]
An example of analyzing the word voita in Finnish (fin). The default analyzer is descriptive. To use a normative analyzer instead, use uralicApi.analyze("voita", "fin", descriptive=False).
Morphological generation
From a lemma and a morphological analysis, it's possible to generate the desired word form.
from uralicNLP import uralicApi
uralicApi.generate("käsi+N+Sg+Par", "fin")
>>[['kättä', 0.0]]
An example of generating the singular partitive form for the Finnish noun käsi. The result is kättä. The default generator is a regular normative generator. uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True) uses a normative dictionary generator and uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True) a descriptive generator.
Morphological segmentation
UralicNLP makes it possible to split a word form into morphemes. (Note: this does not work with all languages)
from uralicNLP import uralicApi
uralicApi.segment("luutapiirinikin", "fin")
>>[['luu', 'tapiiri', 'ni', 'kin'], ['luuta', 'piiri', 'ni', 'kin']]
In the example, the word luutapiirinikin has two possible interpretations luu|tapiiri and luuta|piiri, the segmentation is done for both interpretations.
Disambiguation
This section has been moved to UralicNLP wiki page on disambiguation.
Dictionaries
Learn more about dictionaries in the wiki page on dictionaries.
Parsing UD CoNLL-U annotated TreeBank data
UralicNLP comes with tools for parsing and searching CoNLL-U formatted data. Please refer to the Wiki for the UD parser documentation.
Other functionalities
Cite
If you use UralicNLP in an academic publication, please cite it as follows:
Hämäläinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345
@article{uralicnlp_2019,
title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},
DOI={10.21105/joss.01345},
journal={Journal of Open Source Software},
author={Mika Hämäläinen},
year={2019},
volume={4},
number={37},
pages={1345}
}
For citing the FSTs and CGs, see uralicApi.model_info(language).
The FST and CG tools and dictionaries come mostly from the GiellaLT repositories and Apertium.
Owner
- Name: Mika Hämäläinen
- Login: mikahama
- Kind: user
- Location: Helsinki
- Company: Fly for Points
- Website: http://mikakalevi.com
- Repositories: 35
- Profile: https://github.com/mikahama
PhD in NLP. Currently working at Metropolia University of Applied Sciences as an AI manager.
JOSS Publication
UralicNLP: An NLP Library for Uralic Languages
Tags
morphology syntax semantics constraint grammar finite state morphology Uralic languages natural language processing endangered languagesCitation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: 'UralicNLP: An NLP Library for Uralic Languages'
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Mika
family-names: Hämäläinen
identifiers:
- type: doi
value: 10.5281/zenodo.2668061
description: Zenodo
repository-code: 'https://github.com/mikahama/uralicNLP'
date-released: '2019-05-06'
preferred-citation:
type: article
authors:
- family-names: "Hämäläinen"
given-names: "Mika"
doi: "10.21105/joss.01345"
journal: "Journal of Open Source Software"
title: "UralicNLP: An NLP Library for Uralic Languages"
issue: 37
volume: 4
year: 2019
GitHub Events
Total
- Release event: 4
- Watch event: 9
- Push event: 23
- Gollum event: 29
- Create event: 4
Last Year
- Release event: 4
- Watch event: 9
- Push event: 23
- Gollum event: 29
- Create event: 4
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Hämäläinen, Mika K | m****n@h****i | 84 |
| Mika Hämäläinen | m****a | 62 |
| Mika | m****a@r****m | 33 |
| Mika | m****a@f****m | 9 |
| Hämäläinen Mika K | m****a@d****i | 3 |
| rueter | r****k@g****m | 1 |
| Sjur Moshagen | s****m@m****m | 1 |
| Khalid Alnajjar | d****a@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 17
- Total pull requests: 3
- Average time to close issues: 17 days
- Average time to close pull requests: about 13 hours
- Total issue authors: 13
- Total pull request authors: 3
- Average comments per issue: 1.71
- Average comments per pull request: 1.33
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- nikopartanen (3)
- petri (2)
- ogencoglu (2)
- unhammer (1)
- back2analogue (1)
- ulf1 (1)
- kauttoj (1)
- pyup-bot (1)
- yoge1 (1)
- teemukolehmainen-howspace (1)
- doctorcolossus (1)
- robaki (1)
- reynoldsnlp (1)
Pull Request Authors
- snomos (1)
- albbas (1)
- rueter (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
-
Total downloads:
- pypi 572 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 1
(may contain duplicates) - Total versions: 29
- Total maintainers: 1
proxy.golang.org: github.com/mikahama/uralicnlp
- Documentation: https://pkg.go.dev/github.com/mikahama/uralicnlp#section-documentation
- License: apache-2.0
-
Latest release: v1.0.2
published almost 8 years ago
Rankings
proxy.golang.org: github.com/mikahama/uralicNLP
- Documentation: https://pkg.go.dev/github.com/mikahama/uralicNLP#section-documentation
- License: apache-2.0
-
Latest release: v1.0.2
published almost 8 years ago
Rankings
pypi.org: uralicnlp
An NLP library for Uralic languages such as Finnish and Sami. Also supports Spanish, Arabic, Russian etc.
- Homepage: https://github.com/mikahama/uralicNLP
- Documentation: https://uralicnlp.readthedocs.io/
- License: Apache-2.0 license
-
Latest release: 2.0.3
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v2 composite
- github/codeql-action/analyze v1 composite
- github/codeql-action/autobuild v1 composite
- github/codeql-action/init v1 composite
