LAS
LAS: an integrated language analysis tool for multiple languages - Published in JOSS (2016)
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary
Scientific Fields
Repository
Linguistic Analysis Command-Line Tool
Basic Info
Statistics
- Stars: 14
- Watchers: 5
- Forks: 1
- Open Issues: 1
- Releases: 15
Metadata Files
README.md
Language Analysis Tool
Language Analysis Command-Line Tool for lemmatizing, morphological analysis, inflected form generation, hyphenation and language identification of multiple languages.
These functionalities are of use as part of many workflows requiring natural language processing. Indeed, LAS has been used for example as part of a pipeline for entity recognition, in creating a contextual reader for texts in English, Finnish and Latin, and for processing a Finnish historical newspaper collection in preparation for data publication.
The tools backing these services are mostly not originally our own, but we've wrapped them for your convenience.
Program help:
```
las 1.5.13
Usage: las [lemmatize|analyze|inflect|recognize|identify|hyphenate] [options] [
Command: lemmatize
(locales: pt, mhr, fr, ru, myv, dk, it, mrj, liv, fi, de, es, tr, la, en, sv, udm, nl, mdf, sme, no)
Command: analyze
(locales: mhr, fr, myv, it, mrj, liv, fi, de, tr, la, en, sv, udm, mdf, sme)
Command: inflect
(locales: mhr, fr, myv, it, mrj, liv, fi, de, tr, en, sv, udm, mdf, sme)
Command: recognize
report word recognition rate (locales: mhr, fr, myv, it, mrj, liv, fi, de, tr, la, en, sv, udm, mdf, sme)
Command: identify
identify language (locales: zh-TW, fi, no, hr, ta, ar, fr, is, lv, eu, mt, bn, dk, uk, pa, ga, br, so, pt, cs, fr, gl, sr, zh-CN, mrj, el, it, ca, vi, tl, nl, bg, ko, liv, it, mk, oc, et, af, de, ru, yi, cy, en, udm, ur, mdf, myv, sme, ru, ht, ml, th, id, sq, sv, de, sv, tr, da, en, gu, he, es, kn, sk, es, hi, te, mr, an, sw, be, pt, nl, ja, ast, fi, ro, mhr, ne, lt, no, km, sl, fa, ms, hu, pl, la, tr)
Command: hyphenate
hyphenate (locales: nn, cop, in, sl, mhr, bg, sh, it, sr, uk, mn, mrj, da, liv, fi, hsb, es, eu, tr, hr, ia, ro, udm, mdf, pl, cy, pt, fr, ru, gl, myv, is, sk, ga, sa, zh, et, la, nb, cs, sv, el, ca, hu, nl, sme)
--locale
Installation and running
The LAS binaries at https://github.com/jiemakel/las/releases are actually Java JAR files, to which a tiny shell script has been prepended, running the JAR. Thus, on a UNIX system, after downloading the tool, it should be runnable itself. It may need to be set as executable first, though (e.g. chmod 0755). You can of course run the JAR also directly with other parameters yourself, e.g. java -Xmx2G -jar las --help.
Recent versions of LAS build multiple binaries, where you can trade functionality for smaller file sizes.
The options are: * las: complete package including all support for all languages, but weighing in at almost 600 megabytes * las-fi: complete functionality for (only) Finnish, including edit distance fuzzy analysis for noisy (e.g. OCR errored) data as well as guessed word segmentation for words not in the lexicon (rarely needed) * las-fi-small: basic functionality for (only) Finnish without fuzzy analysis or segmentation for guessed words, but a much smaller file size * las-small: supports all languages, but provide only the basic functionality for Finnish * las-non-fi: supports all languages apart from Finnish
Optimal mode of running
Some of the transducers used by LAS are really quite huge (the biggest two some ~760 megabytes). This is also why the executable package is a whopping 400-900 megabytes (depending on release). This size also means that each time running the program, initial startup will take a significant time (which you can test by running las --help). However, after that, processing will be fluent. This means that to optimally use the tool, you should pass LAS as much data in a single run as possible. LAS should be able to efficiently process both large files, as well as a large number of them. Another option is also to not give LAS a filename, whereby the tool will enter a a streaming mode, processing input line by line.
When running on files, one should also select the appropriate --process-by mode. The default is to process by file, which is suitable for small files. However, if you have larger files, you should process either by paragraph (if you have such paragraphs, separated by two newlines) or by line, if you know sentences won't cross lines.
Functionalities
The library is also exposed as a web service at http://demo.seco.tkk.fi/las/ . The documentation that follows is mostly equivalent to the one there, with the exception that http://demo.seco.tkk.fi/las/ has live examples where you can experiment with the different functionalities and inputs.
Language detection
Run by las identify <files> to operate on files, or las identify for stream operation. If run on files, the output will be saved to files with the suffix of .language added to the filename.
Tries to recognize the language of an input. In total, the language detection supports 78 locales, combining results from three sources:
- The language-detector library (locales
af, an, ar, ast, be, bg, bn, br, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fr, ga, gl, gu, he, hi, hr, ht, hu, id, is, it, ja, km, kn, ko, lt, lv, mk, ml, mr, ms, mt, ne, nl, no, oc, pa, pl, pt, ro, ru, sk, sl, so, sq, sr, sv, sw, ta, te, th, tl, tr, uk, ur, vi, yi, zh-CN, zh-TW), - custom code based on the list of cues at the Wikipedia language recognition chart (locales
cs, de, en, es, et, fi, fr, hu, it, pl, pt, ro, ru, sk, sv), and - finite state transducers provided by the HFST, Omorfi and Giellatekno projects (locales
de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm)
Example:
Input: "The quick brown fox jumps over the lazy dog"
Output: {
"locale" : "en",
"certainty" : 0.6803500000000001,
"details" : {
"languageRecognizerResults" : { "en" : 0.1973 },
"languageDetectorResults" : [ { "en" : 1.0 } ],
"hfstAcceptorResults" : [
{ "en" : 0.84375 },
{ "fi" : 0.09375 },
{ "la" : 0.010416666666666666 },
{ "tr" : 0.010416666666666666 },
{ "sv" : 0.010416666666666666 },
{ "sme" : 0.010416666666666666 },
{ "it" : 0.010416666666666666 },
{ "de" : 0.010416666666666666 }
]
}
}
Lemmatization
Run by las lemmatize <files> to operate on files, or las lemmatize for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .lemmatized added to the filename.
Lemmatizes the input into its base form. Uses finite state transducers provided by the HFST, Omorfi and Giellatekno projects where available (locales de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm).
Snowball stemmers are used for locales dk, es, nl, no, pt, ru (not used: de, en, fi, fr, it, sv)
Note that the quality and scope of the lemmatization varies wildly between languages.
Examples:
Input: "Bobs letters about the missing money from the bank had created a huge kerfuffle"
Output: "bob letter about the miss money from the bank have create a huge kerfuffle"
Input: "Albert osti fagotin ja töräytti puhkuvan melodian maakunnanvoudinvirastossa."
Output: "Albert ostaa fagotti ja töräyttää puhkua melodia maakuntavoutivirasto ."
Morphological analysis
Run by las analyze <files> to operate on files, or las analyze for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .analysis added to the filename.
Gives a morphological analysis of the text. Uses finite state transducers provided by the provided by the HFST, Omorfi and Giellatekno projects.
Supported locales: de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm
Note that the quality and scope of analysis as well as tags returned vary wildly between languages (and see below for Finnish specifically, which has the most support).
Example:
Input: "Bobs letters"
Output:
[ {
"word": "Bobs",
"analysis": [ {
"weight": 1,
"wordParts": [ {
"lemma": "bob",
"tags": {
"NN2-VVZ": [ "NN2-VVZ" ]
} ],
"globalTags": {
"BEST_MATCH": [ "TRUE" ]
}
} ]
}, {
"word": "letters",
"analysis": [ {
"weight": 1,
"wordParts": [ {
"lemma": "letter",
"tags": {
"NN2": [ "NN2" ]
}
} ],
"globalTags": {
"BEST_MATCH": [ "TRUE" ]
}
} ]
} ]
Input: "Albert osti"
Output:
[ {
"word" : "Albert",
"analysis" : [ {
"weight" : 0.099609375,
"wordParts" : [ {
"lemma" : "Albert",
"tags" : {
"SEGMENT" : [ "Albert" ],
"KTN" : [ "5" ],
"UPOS" : [ "PROPN" ],
"NUM" : [ "SG" ],
"PROPER" : [ "LAST" ],
"CASE" : [ "NOM" ]
}
} ],
"globalTags" : {
"HEAD" : [ "2" ],
"DEPREL" : [ "punct" ],
"POS_MATCH" : [ "TRUE" ],
"BEST_MATCH" : [ "TRUE" ]
}
}, {
"weight" : 0.099609375,
"wordParts" : [ {
"lemma" : "Albert",
"tags" : {
"SEGMENT" : [ "Albert" ],
"KTN" : [ "5" ],
"UPOS" : [ "PROPN" ],
"NUM" : [ "SG" ],
"SEM" : [ "MALE" ],
"PROPER" : [ "FIRST" ],
"CASE" : [ "NOM" ]
}
} ],
"globalTags" : {
"HEAD" : [ "2" ],
"DEPREL" : [ "punct" ],
"POS_MATCH" : [ "TRUE" ],
"BEST_MATCH" : [ "TRUE" ]
}
} ]
}, {
"word" : "osti",
"analysis" : [ {
"weight" : 0.099609375,
"wordParts" : [ {
"lemma" : "ostaa",
"tags" : {
"TENSE" : [ "PAST" ],
"SEGMENT" : [ "ost", "{MB}i" ],
"KTN" : [ "53" ],
"UPOS" : [ "VERB" ],
"MOOD" : [ "INDV" ],
"PERS" : [ "SG3" ],
"INFLECTED_FORM" : [ "V N Nom Sg" ],
"VOICE" : [ "ACT" ],
"INFLECTED" : [ "ostaminen" ]
}
} ],
"globalTags" : {
"HEAD" : [ "0" ],
"DEPREL" : [ "punct" ],
"POS_MATCH" : [ "TRUE" ],
"BEST_MATCH" : [ "TRUE" ]
}
} ]
} ]
Inflected form generation
Run by las inflect <files> --forms <forms> to operate on files, or las inflect --forms <forms> for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .inflected added to the filename.
Transforms the text given a set of inflection forms (e.g. V N Nom Sg, N Nom Pl, A Pos Nom Pl), by default also converting words not matching the inflection forms to their base form. This may be useful for example as a pre-processing step when matching text against a vocabulary that has words in it in e.g. plural form.
Uses finite state transducers provided by the provided by the HFST, Omorfi and Giellatekno projects. Note that the inflection form syntaxes differ wildly between languages (in practice, it's often easiest to run analysis on an inflected form to discover how to recreate that form).
Supported locales: de, en, fi, fr, it, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm
Examples:
Input: "Bobs letter about the missing money from the bank creates a large kerfuffle", "NN2,VVN,AJS"
Output: "bobs letters about thes misses moneys from thes banks CREATED As largest kerfuffle"
Input: "Albert osti fagotin ja töräytti puhkuvan melodian.", "V N Nom Sg, N Nom Pl, A Pos Nom Pl"
Output: "Albert ostaminen fagotit ja töräyttäminen puhkuminen melodiat ."
Word recognition rate reporting
Run by las recognize <files> to operate on files, or las recognize for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .recognition added to the filename.
Report the number of words a particular language processor recognizes. This may be useful for e.g. estimating the number of OCR errors in automatically scanned historical newspapers.
Supported locales: de, en, fi, fr, it, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm, la
Examples:
Input: "?l»vatcssaan Satakunnan maanwiljelystotoukscu, joka pidettiin Kautuau tehtaalla Euran pitäjässä"
Output:
{
"locale" : "fi",
"recognized" : 7,
"unrecognized" : 3,
"rate" : 0.7
}
Input: "B»bs letters about the missiing money from the bank had created a huge kerfussle"
Output:
{
"locale" : "en",
"recognized" : 11,
"unrecognized" : 3,
"rate" : 0.7857142857142857
}
Hyphenation
Run by las hyphenate <files> to operate on files, or las hyphenate for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .hyphenated added to the filename.
Hyphenates the given text. Uses finite state transducers provided by the provided by the HFST, Omorfi and Giellatekno projects. Those provided by HFST have been automatically translated from the TeX CTAN distribution's hyphenation rulesets.
Supported locales: bg, ca, cop, cs, cy, da, el, es, et, eu, fi, fr, ga, gl, hr, hsb, hu, ia, in, is, it, la, liv, mdf, mhr, mn, mrj, myv, nb, nl, nn, pl, pt, ro, ru, sa, sh, sk, sl, sme, sr, sv, tr, udm, uk, zh
Examples:
Input: "Albert osti fagotin ja töräytti puhkuvan melodian."
Output: "al-bert os-ti fa-go-tin ja tö-räyt-ti puh-ku-van me-lo-dian"
Input: "Månens yta består i stora drag av två olika typer av landskap"
Output: "må-nens y-ta be-står i sto-ra d-rag av två o-li-ka ty-per av lan-d-skap"
Things to know when using LAS for analyzing Finnish
While LAS supports many languages, the most complete support it has is for Finnish. However, this also makes the functionality complex. Thus, it is useful to delve deeper into what is actually happening.
First, the Finnish analysis is based on a fork of the Omorfi morphological analyzer for Finnish. What the user needs to know about this is that Omorfi normally provides 1) all possible morphological analyses of a word and 2) only works for words that are included in its lexicon and rules.
To this baseline, the functionality in LAS (or the modified Omorfi) adds: 1. support for better sentence splitting and tokenization from Turku NLP. 1. support for guessing the most probable of multiple analyses 1. by using case matching of the initial letter (if not the first word in a sentence) 1. by using machine learned disambiguation from Turku NLP 1. by using word class and inflection -based rules 1. by using word frequency information from the Finnish Wikipedia 1. lemma guessing for words outside the lexicon 1. support for Early Modern Finnish inflection 1. support for edit-distance error correction (by up to 2 steps) in a guessed analysis 1. automatic dehyphenation
Final note: In analysis, Omorfi supports initial capitalization of words, necessitated by needing to analyze first words in a sentence without fuzz. However, nothing else is done. So, pariisi will return only pari as the lemma, and not Pariisi. (As a sidenote, if you actually do want case insensitive matching, you can thus convert every word into initial uppercase, but that will mess with the disambiguation)
Examples of the various rules in action in lemmatization:
* Pariisi -> pari (initial case is ignored for first word in a sentence)
* Pariisissa -> Pariisi (cannot be an inflected form of pari)
* Pariisi on -> Pariisi olla (machine learned disambiguation guesses correctly)
* pariisi on -> pari olla (uppercasing not allowed)
* oli Pariisi -> olla Pariisi (case change not allowed after first word in a sentence)
* oli pariisi -> olla pari (case change not allowed after first word in a sentence)
* kuin -> kuin (instead of kuu, based on word class and inflection rules)
* twiittasin -> tviitata, (guessed, twiittasin for --no-guess)
* Leh>tim»ehen -> Lehtimies for --max-edit-distance 2
* Helsingin -> Helsinki (instead of the last name Helsing, based on Wikipedia frequency)
Evaluation
Below, LAS lemmatisation accuracy on Finnish is compared to the neural network version of the TurkuNLP parser (see https://universaldependencies.org/conll18/results-lemmas.html).
| Dataset | LAS | TurkuNLP-NN | |---------|-----|-------------| | FI-FTB | 93.44% | 97.02% | | FI-PUD | 93.34% | 95.07% | | FI-TDT | 92.00% | 95.32% |
Contributing
If you encounter problems, open an issue in GitHub. Pull requests also naturally welcome. If you wish to delve deeper into how the tool works, be aware that this repository contains just one of two front ends. Many more lines of code are contained in the seco-lexicalanalysis repository, which contains the code common to this command line version and the web service version (seco-lexicalanalysis-play). They in turn refer to seco-hfst. In addition, the in-depth work on integrating and expanding the Finnish pipeline included in the tool builds heavily on our omorfi fork.
Owner
- Name: Human Sciences – Computing Interaction Research Group
- Login: hsci-r
- Kind: organization
- Location: University of Helsinki
- Website: http://heldig.fi/hsci/
- Twitter: hsci_research
- Repositories: 61
- Profile: https://github.com/hsci-r
CodeMeta (codemeta.json)
{
"@context": "https://raw.githubusercontent.com/mbjones/codemeta/master/codemeta.jsonld",
"@type": "Code",
"author": [
{
"@id": "http://orcid.org/0000-0002-8366-8414",
"@type": "Person",
"email": "eetu.makela@iki.fi",
"name": "Eetu Mäkelä",
"affiliation": "Aalto University"
}
],
"identifier": "http://dx.doi.org/10.5281/zenodo.160256",
"codeRepository": "https://github.com/jiemakel/las",
"datePublished": "2016-10-12",
"dateModified": "2016-10-12",
"dateCreated": "2016-06-28",
"description": "Lexical Analysis Command-Line Tool for lemmatizing, lexical analysis, inflected form generation and language identification of multiple languages.",
"keywords": "NLP, lexical analysis, morphological analysis, lemmatization, language identification",
"license": "MIT",
"title": "Lexical Analysis Tool",
"version": "v1.4.9"
}
GitHub Events
Total
Last Year
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Eetu Mäkelä | e****a@i****i | 62 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 4
- Total pull requests: 1
- Average time to close issues: 12 months
- Average time to close pull requests: about 9 hours
- Total issue authors: 3
- Total pull request authors: 1
- Average comments per issue: 1.75
- Average comments per pull request: 1.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jiemakel (2)
- jukkahuhtamaki (1)
- mjlassila (1)
Pull Request Authors
- sasuode (1)