com.rootroo

Multilingual Natural Language Processing for Java

https://github.com/mikahama/uralicnlp-java

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

java maven natural-language-processing nlg nlp tokenization
Last synced: 4 months ago · JSON representation ·

Repository

Multilingual Natural Language Processing for Java

Basic Info
  • Host: GitHub
  • Owner: mikahama
  • License: other
  • Language: Java
  • Default Branch: main
  • Homepage:
  • Size: 227 KB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
java maven natural-language-processing nlg nlp tokenization
Created almost 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme Contributing Funding Code of conduct Citation Security

README.md

UralicNLP - Multilingual Natural Language Processing for Java

DOI Maven

UralicNLP can produce morphological analyses, generate morphological forms, lemmatize words and give lexical information about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami.

See the list of supported languages

Check out UralicNLP for Python

Installation

UralicNLP is available through Maven, all you need to do is to add the following to your pom.xml:

<dependencies>
    <dependency>
        <groupId>com.rootroo</groupId>
        <artifactId>uralicnlp</artifactId>
        <version>1.0</version>
    </dependency>
</dependencies>

You can also download the JAR file from the GitHub releases page, but then you may need to download UralicNLP's dependencies by hand.

If you want to use the Constraint Grammar features (com.rootroo.uralicnlp.Cg3), you will also need to install VISL CG-3.

Download Models

In order to use any of the language specific features, you will need to download the models for each language by passing the ISO code of the language to the download method:

import com.rootroo.uralicnlp.UralicApi

UralicApi api = new UralicApi();
api.download("fin")

The models will be downloaded to .uralicnlp folder in your home directory.

Tokenization

You can tokenize a text into sentences and words. This method supports abreviations in languages that have appeared in a Universal Dependencies Treebank.

import com.rootroo.uralicnlp.Tokenizer

Tokenizer tokenizer = new Tokenizer();
String sentence = "Mr. Burns talks with Dr. Hibbert. But why?";
System.out.println(tokenizer.tokenize(sentence));
>>[[Mr., Burns, talks, with, Dr., Hibbert, .], [But, why, ?]]

The output is a List of tokenized sentences that are Lists of strings, where each string represents a tokenized word.

It is also possible to tokenize text only on a sentence level:

import com.rootroo.uralicnlp.Tokenizer

Tokenizer tokenizer = new Tokenizer();
String sentence = "Mr. Burns talks with Dr. Hibbert. But why?";
System.out.println(tokenizer.sentences(sentence));
>>[Mr. Burns talks with Dr. Hibbert., But why?]

Or on a word level:

import com.rootroo.uralicnlp.Tokenizer

Tokenizer tokenizer = new Tokenizer();
String sentence = "Mr. Burns talks with Dr. Hibbert. But why?";
System.out.println(tokenizer.words(sentence));
>>[Mr., Burns, talks, with, Dr., Hibbert, ., But, why, ?]

Lemmatization

To lemmatize a single word, use the lemmatize method. This will produce a list of all the possible lemmas.

import com.rootroo.uralicnlp.UralicApi

UralicApi api = new UralicApi();
System.out.println(api.lemmatize("voin", "fin"));
>> [voi, vuo, voida]

To mark word boundaries in compound words, pass an additional true to the lemmatize method:

import com.rootroo.uralicnlp.UralicApi

UralicApi api = new UralicApi();
System.out.println(api.lemmatize("luutapiiri", "fin", true)));
>> [luu|tapiiri, luuta|piiri]

Morphology

To analyze the morpholgy including the part-of-speech of a given word, use the analyze method. This will return all the possible morphological interpretations for the input word:

import com.rootroo.uralicnlp.UralicApi

UralicApi api = new UralicApi();
HashMap<String, Float> results = api.analyze("voin", "fin");
for(String s : results.keySet()){
    System.out.println(s);
}

>>voi+N+Sg+Gen
>>vuo+N+Pl+Ins
>>voida+V+Act+Ind+Prt+Sg1
>>voi+N+Pl+Ins
>>voida+V+Act+Ind+Prs+Sg1

The result is a HashMap where the keys are morphological readings and the values are the weights (NB most of the models do not have weights).

You can also inflect words by using the generate method:

import com.rootroo.uralicnlp.UralicApi

UralicApi api = new UralicApi();
HashMap<String, Float> results = api.generate("voida+V+Act+Ind+Prt+Sg1", "fin");
for(String s : results.keySet()){
    System.out.println(s);
}
>>voin

The output is a similar HashMap as in the case of analyze.

Disambiguation

The UralicNLP method analyze produces a list of all the possible morphological readings of a word. It is more practical to parse full sentences because then the context can be used to disambiguate the actual morphological reading. Note: You will need to install install VISL CG-3 and ensure it is in the PATH environment variable in your IDE.

import com.rootroo.uralicnlp.Cg3
import com.rootroo.uralicnlp.Tokenizer
import com.rootroo.uralicnlp.Cg3Word

Cg3 cg = new Cg3("fin");
Tokenizer tokenizer = new Tokenizer();
String sentence = "Kissa voi nauraa";
List<String> tokens = tokenizer.words(sentence);
System.out.println(cg.disambiguate(tokens));
>>[[<Kissa - N, <fin>, Prop, Sem/Geo, Sg, Nom, <W:0.000000>, @SUBJ>>, <kissa - N, <fin>, Sg, Nom, <W:0.000000>, @SUBJ>>, <Kissa - N, <fin>, Prop, Sg, Nom, <W:0.000000>, @SUBJ>>], [<voida - V, <fin>, Act, Ind, Prs, Sg3, <W:0.000000>, @+FAUXV>], [<nauraa - V, <fin>, Act, InfA, Sg, Lat, <W:0.000000>, @-FMAINV>]]

The result is a List of Cg3Word Lists. Because the disambiguator only narrows down the possible morphological readings, each word may still have more than one reading left. You can iterate over the results like so:

import com.rootroo.uralicnlp.Cg3
import com.rootroo.uralicnlp.Tokenizer
import com.rootroo.uralicnlp.Cg3Word

Cg3 cg = new Cg3("fin");
Tokenizer tokenizer = new Tokenizer();
String sentence = "Kissa voi nauraa";
List<String> tokens = tokenizer.words(sentence);
ArrayList<ArrayList<Cg3Word>> disambiguatedSentence = cg.disambiguate(tokens);
for(ArrayList<Cg3Word> wordReadings : disambiguatedSentence){
    for(Cg3Word wordReading :wordReadings){
        System.out.println("Form: " + wordReading.form + " lemma: " + wordReading.lemma + " morphology: " + String.join(", ", wordReading.morphology));
    }
    System.out.println("---");
}

>>Form: Kissa lemma: Kissa morphology: N, <fin>, Prop, Sem/Geo, Sg, Nom, <W:0.000000>, @SUBJ>
>>Form: Kissa lemma: kissa morphology: N, <fin>, Sg, Nom, <W:0.000000>, @SUBJ>
>>Form: Kissa lemma: Kissa morphology: N, <fin>, Prop, Sg, Nom, <W:0.000000>, @SUBJ>
>>---
>>Form: voi lemma: voida morphology: V, <fin>, Act, Ind, Prs, Sg3, <W:0.000000>, @+FAUXV
>>---
>>Form: nauraa lemma: nauraa morphology: V, <fin>, Act, InfA, Sg, Lat, <W:0.000000>, @-FMAINV
>>---

Universal Dependencies Parser

You can load a CoNLL-U formatted file and parse it by running:

import com.rootroo.uralicnlp.UDSentence
import com.rootroo.uralicnlp.UDCollection
import com.rootroo.uralicnlp.UDNode

FileInputStream fis = new FileInputStream("sms_giellagas-ud-test.conllu");
InputStreamReader isr = new InputStreamReader(fis, StandardCharsets.UTF_8);
BufferedReader reader = new BufferedReader(isr);

UDCollection udCollection = new UDCollection(reader);
for(UDSentence sentence : udCollection){
    for(UDNode word : sentence){
        System.out.println(word.lemma + " " + word.pos + " " + word.deprelName());
    }
    System.out.println("---");
}
>>son PRON nsubj
>>tte ADV advmod:tmod
>>pi ADV advmod
>>... PUNCT punct
>>---
>>tt PRON nsubj
>>vuejjled VERB root
>>. PUNCT punct
>>---

UDCollection can be initialized either with a BufferedReader or String that contains CoNLL-U formatted data. The UDCollection consists of UDSentence objects which contain UDNode objects. Each UDNode corresponds to a word of a Universal Dependencies sentence and it has information such as lemma and part of speech. More about Universal Dependencies tags.

To parse an individual Universal Dependencies (CoNLL-U) formatted sentence, you can run the following:

import com.rootroo.uralicnlp.UDSentence
import com.rootroo.uralicnlp.UDTools
import com.rootroo.uralicnlp.UDNode

String conl = "# text = Toinen palkinto\n1\tToinen\ttoinen\tADJ\tNum\tCase=Nom\t2\tnummod\t_\t_\n2\tpalkinto\tpalkinto\tNOUN\tN\tCase=Nom\t0\troot\t_\t_";
UDSentence sentence = UDTools.parseSentence(conl);
for(UDNode word : sentence){
    System.out.println(word.lemma + " " + word.pos + " " + word.deprelName());
}

>>toinen ADJ nummod
>>palkinto NOUN root

Cite

If you use UralicNLP in an academic publication, please cite it as follows:

Hmlinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345

@article{uralicnlp_2019, 
    title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},
    DOI={10.21105/joss.01345}, 
    journal={Journal of Open Source Software}, 
    author={Mika Hmlinen}, 
    year={2019}, 
    volume={4},
    number={37},
    pages={1345}
}

The FST and CG tools and dictionaries come mostly from the GiellaLT repositories and Apertium.

Owner

  • Name: Mika Hämäläinen
  • Login: mikahama
  • Kind: user
  • Location: Helsinki
  • Company: Fly for Points

PhD in NLP. Currently working at Metropolia University of Applied Sciences as an AI manager.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'UralicNLP: An NLP Library for Uralic Languages'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Mika
    family-names: Hämäläinen
identifiers:
  - type: doi
    value: 10.5281/zenodo.5816221
    description: Zenodo
repository-code: 'https://github.com/mikahama/uralicNLP-Java'
date-released: '2019-05-06'
preferred-citation:
  type: article
  authors:
  - family-names: "Hämäläinen"
    given-names: "Mika"
  doi: "10.21105/joss.01345"
  journal: "Journal of Open Source Software"
  title: "UralicNLP: An NLP Library for Uralic Languages"
  issue: 37
  volume: 4
  year: 2019

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
repo1.maven.org: com.rootroo:uralicnlp

NLP tools (generation, analysis, lemmatization) for multiple languages: Finnish, Russian, German...

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 32.0%
Stargazers count: 35.0%
Average: 38.9%
Forks count: 39.8%
Dependent packages count: 48.9%
Last synced: 4 months ago

Dependencies

pom.xml maven
  • com.googlecode.json-simple:json-simple 1.1.1
  • fi.seco:hfst 1.1.5
  • me.tongfei:progressbar 0.5.5
  • junit:junit 4.11 test