com.rootroo
Multilingual Natural Language Processing for Java
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Repository
Multilingual Natural Language Processing for Java
Basic Info
Statistics
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
UralicNLP - Multilingual Natural Language Processing for Java
UralicNLP can produce morphological analyses, generate morphological forms, lemmatize words and give lexical information about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami.
See the list of supported languages
Check out UralicNLP for Python
Installation
UralicNLP is available through Maven, all you need to do is to add the following to your pom.xml:
<dependencies>
<dependency>
<groupId>com.rootroo</groupId>
<artifactId>uralicnlp</artifactId>
<version>1.0</version>
</dependency>
</dependencies>
You can also download the JAR file from the GitHub releases page, but then you may need to download UralicNLP's dependencies by hand.
If you want to use the Constraint Grammar features (com.rootroo.uralicnlp.Cg3), you will also need to install VISL CG-3.
Download Models
In order to use any of the language specific features, you will need to download the models for each language by passing the ISO code of the language to the download method:
import com.rootroo.uralicnlp.UralicApi
UralicApi api = new UralicApi();
api.download("fin")
The models will be downloaded to .uralicnlp folder in your home directory.
Tokenization
You can tokenize a text into sentences and words. This method supports abreviations in languages that have appeared in a Universal Dependencies Treebank.
import com.rootroo.uralicnlp.Tokenizer
Tokenizer tokenizer = new Tokenizer();
String sentence = "Mr. Burns talks with Dr. Hibbert. But why?";
System.out.println(tokenizer.tokenize(sentence));
>>[[Mr., Burns, talks, with, Dr., Hibbert, .], [But, why, ?]]
The output is a List of tokenized sentences that are Lists of strings, where each string represents a tokenized word.
It is also possible to tokenize text only on a sentence level:
import com.rootroo.uralicnlp.Tokenizer
Tokenizer tokenizer = new Tokenizer();
String sentence = "Mr. Burns talks with Dr. Hibbert. But why?";
System.out.println(tokenizer.sentences(sentence));
>>[Mr. Burns talks with Dr. Hibbert., But why?]
Or on a word level:
import com.rootroo.uralicnlp.Tokenizer
Tokenizer tokenizer = new Tokenizer();
String sentence = "Mr. Burns talks with Dr. Hibbert. But why?";
System.out.println(tokenizer.words(sentence));
>>[Mr., Burns, talks, with, Dr., Hibbert, ., But, why, ?]
Lemmatization
To lemmatize a single word, use the lemmatize method. This will produce a list of all the possible lemmas.
import com.rootroo.uralicnlp.UralicApi
UralicApi api = new UralicApi();
System.out.println(api.lemmatize("voin", "fin"));
>> [voi, vuo, voida]
To mark word boundaries in compound words, pass an additional true to the lemmatize method:
import com.rootroo.uralicnlp.UralicApi
UralicApi api = new UralicApi();
System.out.println(api.lemmatize("luutapiiri", "fin", true)));
>> [luu|tapiiri, luuta|piiri]
Morphology
To analyze the morpholgy including the part-of-speech of a given word, use the analyze method. This will return all the possible morphological interpretations for the input word:
import com.rootroo.uralicnlp.UralicApi
UralicApi api = new UralicApi();
HashMap<String, Float> results = api.analyze("voin", "fin");
for(String s : results.keySet()){
System.out.println(s);
}
>>voi+N+Sg+Gen
>>vuo+N+Pl+Ins
>>voida+V+Act+Ind+Prt+Sg1
>>voi+N+Pl+Ins
>>voida+V+Act+Ind+Prs+Sg1
The result is a HashMap where the keys are morphological readings and the values are the weights (NB most of the models do not have weights).
You can also inflect words by using the generate method:
import com.rootroo.uralicnlp.UralicApi
UralicApi api = new UralicApi();
HashMap<String, Float> results = api.generate("voida+V+Act+Ind+Prt+Sg1", "fin");
for(String s : results.keySet()){
System.out.println(s);
}
>>voin
The output is a similar HashMap as in the case of analyze.
Disambiguation
The UralicNLP method analyze produces a list of all the possible morphological readings of a word. It is more practical to parse full sentences because then the context can be used to disambiguate the actual morphological reading. Note: You will need to install install VISL CG-3 and ensure it is in the PATH environment variable in your IDE.
import com.rootroo.uralicnlp.Cg3
import com.rootroo.uralicnlp.Tokenizer
import com.rootroo.uralicnlp.Cg3Word
Cg3 cg = new Cg3("fin");
Tokenizer tokenizer = new Tokenizer();
String sentence = "Kissa voi nauraa";
List<String> tokens = tokenizer.words(sentence);
System.out.println(cg.disambiguate(tokens));
>>[[<Kissa - N, <fin>, Prop, Sem/Geo, Sg, Nom, <W:0.000000>, @SUBJ>>, <kissa - N, <fin>, Sg, Nom, <W:0.000000>, @SUBJ>>, <Kissa - N, <fin>, Prop, Sg, Nom, <W:0.000000>, @SUBJ>>], [<voida - V, <fin>, Act, Ind, Prs, Sg3, <W:0.000000>, @+FAUXV>], [<nauraa - V, <fin>, Act, InfA, Sg, Lat, <W:0.000000>, @-FMAINV>]]
The result is a List of Cg3Word Lists. Because the disambiguator only narrows down the possible morphological readings, each word may still have more than one reading left. You can iterate over the results like so:
import com.rootroo.uralicnlp.Cg3
import com.rootroo.uralicnlp.Tokenizer
import com.rootroo.uralicnlp.Cg3Word
Cg3 cg = new Cg3("fin");
Tokenizer tokenizer = new Tokenizer();
String sentence = "Kissa voi nauraa";
List<String> tokens = tokenizer.words(sentence);
ArrayList<ArrayList<Cg3Word>> disambiguatedSentence = cg.disambiguate(tokens);
for(ArrayList<Cg3Word> wordReadings : disambiguatedSentence){
for(Cg3Word wordReading :wordReadings){
System.out.println("Form: " + wordReading.form + " lemma: " + wordReading.lemma + " morphology: " + String.join(", ", wordReading.morphology));
}
System.out.println("---");
}
>>Form: Kissa lemma: Kissa morphology: N, <fin>, Prop, Sem/Geo, Sg, Nom, <W:0.000000>, @SUBJ>
>>Form: Kissa lemma: kissa morphology: N, <fin>, Sg, Nom, <W:0.000000>, @SUBJ>
>>Form: Kissa lemma: Kissa morphology: N, <fin>, Prop, Sg, Nom, <W:0.000000>, @SUBJ>
>>---
>>Form: voi lemma: voida morphology: V, <fin>, Act, Ind, Prs, Sg3, <W:0.000000>, @+FAUXV
>>---
>>Form: nauraa lemma: nauraa morphology: V, <fin>, Act, InfA, Sg, Lat, <W:0.000000>, @-FMAINV
>>---
Universal Dependencies Parser
You can load a CoNLL-U formatted file and parse it by running:
import com.rootroo.uralicnlp.UDSentence
import com.rootroo.uralicnlp.UDCollection
import com.rootroo.uralicnlp.UDNode
FileInputStream fis = new FileInputStream("sms_giellagas-ud-test.conllu");
InputStreamReader isr = new InputStreamReader(fis, StandardCharsets.UTF_8);
BufferedReader reader = new BufferedReader(isr);
UDCollection udCollection = new UDCollection(reader);
for(UDSentence sentence : udCollection){
for(UDNode word : sentence){
System.out.println(word.lemma + " " + word.pos + " " + word.deprelName());
}
System.out.println("---");
}
>>son PRON nsubj
>>tte ADV advmod:tmod
>>pi ADV advmod
>>... PUNCT punct
>>---
>>tt PRON nsubj
>>vuejjled VERB root
>>. PUNCT punct
>>---
UDCollection can be initialized either with a BufferedReader or String that contains CoNLL-U formatted data. The UDCollection consists of UDSentence objects which contain UDNode objects. Each UDNode corresponds to a word of a Universal Dependencies sentence and it has information such as lemma and part of speech. More about Universal Dependencies tags.
To parse an individual Universal Dependencies (CoNLL-U) formatted sentence, you can run the following:
import com.rootroo.uralicnlp.UDSentence
import com.rootroo.uralicnlp.UDTools
import com.rootroo.uralicnlp.UDNode
String conl = "# text = Toinen palkinto\n1\tToinen\ttoinen\tADJ\tNum\tCase=Nom\t2\tnummod\t_\t_\n2\tpalkinto\tpalkinto\tNOUN\tN\tCase=Nom\t0\troot\t_\t_";
UDSentence sentence = UDTools.parseSentence(conl);
for(UDNode word : sentence){
System.out.println(word.lemma + " " + word.pos + " " + word.deprelName());
}
>>toinen ADJ nummod
>>palkinto NOUN root
Cite
If you use UralicNLP in an academic publication, please cite it as follows:
Hmlinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345
@article{uralicnlp_2019,
title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},
DOI={10.21105/joss.01345},
journal={Journal of Open Source Software},
author={Mika Hmlinen},
year={2019},
volume={4},
number={37},
pages={1345}
}
The FST and CG tools and dictionaries come mostly from the GiellaLT repositories and Apertium.
Owner
- Name: Mika Hämäläinen
- Login: mikahama
- Kind: user
- Location: Helsinki
- Company: Fly for Points
- Website: http://mikakalevi.com
- Repositories: 35
- Profile: https://github.com/mikahama
PhD in NLP. Currently working at Metropolia University of Applied Sciences as an AI manager.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: 'UralicNLP: An NLP Library for Uralic Languages'
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Mika
family-names: Hämäläinen
identifiers:
- type: doi
value: 10.5281/zenodo.5816221
description: Zenodo
repository-code: 'https://github.com/mikahama/uralicNLP-Java'
date-released: '2019-05-06'
preferred-citation:
type: article
authors:
- family-names: "Hämäläinen"
given-names: "Mika"
doi: "10.21105/joss.01345"
journal: "Journal of Open Source Software"
title: "UralicNLP: An NLP Library for Uralic Languages"
issue: 37
volume: 4
year: 2019
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 5 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 1
repo1.maven.org: com.rootroo:uralicnlp
NLP tools (generation, analysis, lemmatization) for multiple languages: Finnish, Russian, German...
- Homepage: https://github.com/mikahama/uralicNLP-java
- Documentation: https://appdoc.app/artifact/com.rootroo/uralicnlp/
- License: CC BY-NC-ND 4.0
-
Latest release: 1.0
published almost 4 years ago
Rankings
Dependencies
- com.googlecode.json-simple:json-simple 1.1.1
- fi.seco:hfst 1.1.5
- me.tongfei:progressbar 0.5.5
- junit:junit 4.11 test