LAS

LAS: an integrated language analysis tool for multiple languages - Published in JOSS (2016)

https://github.com/hsci-r/las

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 40% confidence

Last synced: 4 months ago · JSON representation

Repository

Linguistic Analysis Command-Line Tool

Basic Info

Host: GitHub
Owner: hsci-r
License: mit
Language: Scala
Default Branch: master
Homepage:
Size: 48.8 KB

Statistics

Stars: 14
Watchers: 5
Forks: 1
Open Issues: 1
Releases: 15

Created almost 11 years ago · Last pushed over 6 years ago

Metadata Files

Readme License Codemeta

Language Analysis Tool

Cite as:

Language Analysis Command-Line Tool for lemmatizing, morphological analysis, inflected form generation, hyphenation and language identification of multiple languages.

These functionalities are of use as part of many workflows requiring natural language processing. Indeed, LAS has been used for example as part of a pipeline for entity recognition, in creating a contextual reader for texts in English, Finnish and Latin, and for processing a Finnish historical newspaper collection in preparation for data publication.

The tools backing these services are mostly not originally our own, but we've wrapped them for your convenience.

Command: lemmatize (locales: pt, mhr, fr, ru, myv, dk, it, mrj, liv, fi, de, es, tr, la, en, sv, udm, nl, mdf, sme, no) Command: analyze (locales: mhr, fr, myv, it, mrj, liv, fi, de, tr, la, en, sv, udm, mdf, sme) Command: inflect (locales: mhr, fr, myv, it, mrj, liv, fi, de, tr, en, sv, udm, mdf, sme) Command: recognize report word recognition rate (locales: mhr, fr, myv, it, mrj, liv, fi, de, tr, la, en, sv, udm, mdf, sme) Command: identify identify language (locales: zh-TW, fi, no, hr, ta, ar, fr, is, lv, eu, mt, bn, dk, uk, pa, ga, br, so, pt, cs, fr, gl, sr, zh-CN, mrj, el, it, ca, vi, tl, nl, bg, ko, liv, it, mk, oc, et, af, de, ru, yi, cy, en, udm, ur, mdf, myv, sme, ru, ht, ml, th, id, sq, sv, de, sv, tr, da, en, gu, he, es, kn, sk, es, hi, te, mr, an, sw, be, pt, nl, ja, ast, fi, ro, mhr, ne, lt, no, km, sl, fa, ms, hu, pl, la, tr) Command: hyphenate hyphenate (locales: nn, cop, in, sl, mhr, bg, sh, it, sr, uk, mn, mrj, da, liv, fi, hsb, es, eu, tr, hr, ia, ro, udm, mdf, pl, cy, pt, fr, ru, gl, myv, is, sk, ga, sa, zh, et, la, nb, cs, sv, el, ca, hu, nl, sme) --locale possible locales --forms inflection forms for inflect/analyze --segment segment baseforms? --no-guess Don't guess baseforms for unknown words? --no-segment-guessed Don't guess segmentation information for guessed words (speeds up processing significantly)? --process-by Analysis unit when processing files (file, paragraph, line) (default=paragraph)? --depth Analysis depth (0-2, 1=apply machine learned best analysis guessing, 2=include dependency analysis in output) (default 1)? --max-edit-distance Maximum edit distance for error-correcting unidentified words (default 0)? --no-pretty Don't pretty print json? ... files to process (stdin if not given. Will process directories recursively) --help prints this usage text ```

Installation and running

The LAS binaries at https://github.com/jiemakel/las/releases are actually Java JAR files, to which a tiny shell script has been prepended, running the JAR. Thus, on a UNIX system, after downloading the tool, it should be runnable itself. It may need to be set as executable first, though (e.g. chmod 0755). You can of course run the JAR also directly with other parameters yourself, e.g. java -Xmx2G -jar las --help.

Recent versions of LAS build multiple binaries, where you can trade functionality for smaller file sizes.

The options are: * las: complete package including all support for all languages, but weighing in at almost 600 megabytes * las-fi: complete functionality for (only) Finnish, including edit distance fuzzy analysis for noisy (e.g. OCR errored) data as well as guessed word segmentation for words not in the lexicon (rarely needed) * las-fi-small: basic functionality for (only) Finnish without fuzzy analysis or segmentation for guessed words, but a much smaller file size * las-small: supports all languages, but provide only the basic functionality for Finnish * las-non-fi: supports all languages apart from Finnish

Optimal mode of running

Some of the transducers used by LAS are really quite huge (the biggest two some ~760 megabytes). This is also why the executable package is a whopping 400-900 megabytes (depending on release). This size also means that each time running the program, initial startup will take a significant time (which you can test by running las --help). However, after that, processing will be fluent. This means that to optimally use the tool, you should pass LAS as much data in a single run as possible. LAS should be able to efficiently process both large files, as well as a large number of them. Another option is also to not give LAS a filename, whereby the tool will enter a a streaming mode, processing input line by line.

When running on files, one should also select the appropriate --process-by mode. The default is to process by file, which is suitable for small files. However, if you have larger files, you should process either by paragraph (if you have such paragraphs, separated by two newlines) or by line, if you know sentences won't cross lines.

Functionalities

The library is also exposed as a web service at http://demo.seco.tkk.fi/las/ . The documentation that follows is mostly equivalent to the one there, with the exception that http://demo.seco.tkk.fi/las/ has live examples where you can experiment with the different functionalities and inputs.

Language detection

Run by las identify <files> to operate on files, or las identify for stream operation. If run on files, the output will be saved to files with the suffix of .language added to the filename.

Tries to recognize the language of an input. In total, the language detection supports 78 locales, combining results from three sources:

The language-detector library (locales af, an, ar, ast, be, bg, bn, br, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fr, ga, gl, gu, he, hi, hr, ht, hu, id, is, it, ja, km, kn, ko, lt, lv, mk, ml, mr, ms, mt, ne, nl, no, oc, pa, pl, pt, ro, ru, sk, sl, so, sq, sr, sv, sw, ta, te, th, tl, tr, uk, ur, vi, yi, zh-CN, zh-TW),
custom code based on the list of cues at the Wikipedia language recognition chart (locales cs, de, en, es, et, fi, fr, hu, it, pl, pt, ro, ru, sk, sv), and
finite state transducers provided by the HFST, Omorfi and Giellatekno projects (locales de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm)

Example: Input: "The quick brown fox jumps over the lazy dog" Output: { "locale" : "en", "certainty" : 0.6803500000000001, "details" : { "languageRecognizerResults" : { "en" : 0.1973 }, "languageDetectorResults" : [ { "en" : 1.0 } ], "hfstAcceptorResults" : [ { "en" : 0.84375 }, { "fi" : 0.09375 }, { "la" : 0.010416666666666666 }, { "tr" : 0.010416666666666666 }, { "sv" : 0.010416666666666666 }, { "sme" : 0.010416666666666666 }, { "it" : 0.010416666666666666 }, { "de" : 0.010416666666666666 } ] } }

Lemmatization

Run by las lemmatize <files> to operate on files, or las lemmatize for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .lemmatized added to the filename.

Lemmatizes the input into its base form. Uses finite state transducers provided by the HFST, Omorfi and Giellatekno projects where available (locales de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm). Snowball stemmers are used for locales dk, es, nl, no, pt, ru (not used: de, en, fi, fr, it, sv)

Note that the quality and scope of the lemmatization varies wildly between languages.

Examples: Input: "Bobs letters about the missing money from the bank had created a huge kerfuffle" Output: "bob letter about the miss money from the bank have create a huge kerfuffle"

Input: "Albert osti fagotin ja töräytti puhkuvan melodian maakunnanvoudinvirastossa." Output: "Albert ostaa fagotti ja töräyttää puhkua melodia maakuntavoutivirasto ."

Morphological analysis

Run by las analyze <files> to operate on files, or las analyze for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .analysis added to the filename.

Gives a morphological analysis of the text. Uses finite state transducers provided by the provided by the HFST, Omorfi and Giellatekno projects. Supported locales: de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm

Note that the quality and scope of analysis as well as tags returned vary wildly between languages (and see below for Finnish specifically, which has the most support).

Example: Input: "Bobs letters" Output: [ { "word": "Bobs", "analysis": [ { "weight": 1, "wordParts": [ { "lemma": "bob", "tags": { "NN2-VVZ": [ "NN2-VVZ" ] } ], "globalTags": { "BEST_MATCH": [ "TRUE" ] } } ] }, { "word": "letters", "analysis": [ { "weight": 1, "wordParts": [ { "lemma": "letter", "tags": { "NN2": [ "NN2" ] } } ], "globalTags": { "BEST_MATCH": [ "TRUE" ] } } ] } ]

Input: "Albert osti" Output: [ { "word" : "Albert", "analysis" : [ { "weight" : 0.099609375, "wordParts" : [ { "lemma" : "Albert", "tags" : { "SEGMENT" : [ "Albert" ], "KTN" : [ "5" ], "UPOS" : [ "PROPN" ], "NUM" : [ "SG" ], "PROPER" : [ "LAST" ], "CASE" : [ "NOM" ] } } ], "globalTags" : { "HEAD" : [ "2" ], "DEPREL" : [ "punct" ], "POS_MATCH" : [ "TRUE" ], "BEST_MATCH" : [ "TRUE" ] } }, { "weight" : 0.099609375, "wordParts" : [ { "lemma" : "Albert", "tags" : { "SEGMENT" : [ "Albert" ], "KTN" : [ "5" ], "UPOS" : [ "PROPN" ], "NUM" : [ "SG" ], "SEM" : [ "MALE" ], "PROPER" : [ "FIRST" ], "CASE" : [ "NOM" ] } } ], "globalTags" : { "HEAD" : [ "2" ], "DEPREL" : [ "punct" ], "POS_MATCH" : [ "TRUE" ], "BEST_MATCH" : [ "TRUE" ] } } ] }, { "word" : "osti", "analysis" : [ { "weight" : 0.099609375, "wordParts" : [ { "lemma" : "ostaa", "tags" : { "TENSE" : [ "PAST" ], "SEGMENT" : [ "ost", "{MB}i" ], "KTN" : [ "53" ], "UPOS" : [ "VERB" ], "MOOD" : [ "INDV" ], "PERS" : [ "SG3" ], "INFLECTED_FORM" : [ "V N Nom Sg" ], "VOICE" : [ "ACT" ], "INFLECTED" : [ "ostaminen" ] } } ], "globalTags" : { "HEAD" : [ "0" ], "DEPREL" : [ "punct" ], "POS_MATCH" : [ "TRUE" ], "BEST_MATCH" : [ "TRUE" ] } } ] } ]

Inflected form generation

Run by las inflect <files> --forms <forms> to operate on files, or las inflect --forms <forms> for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .inflected added to the filename.

Transforms the text given a set of inflection forms (e.g. V N Nom Sg, N Nom Pl, A Pos Nom Pl), by default also converting words not matching the inflection forms to their base form. This may be useful for example as a pre-processing step when matching text against a vocabulary that has words in it in e.g. plural form.

Uses finite state transducers provided by the provided by the HFST, Omorfi and Giellatekno projects. Note that the inflection form syntaxes differ wildly between languages (in practice, it's often easiest to run analysis on an inflected form to discover how to recreate that form).

Supported locales: de, en, fi, fr, it, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm

Examples: Input: "Bobs letter about the missing money from the bank creates a large kerfuffle", "NN2,VVN,AJS" Output: "bobs letters about thes misses moneys from thes banks CREATED As largest kerfuffle"

Input: "Albert osti fagotin ja töräytti puhkuvan melodian.", "V N Nom Sg, N Nom Pl, A Pos Nom Pl" Output: "Albert ostaminen fagotit ja töräyttäminen puhkuminen melodiat ."

Word recognition rate reporting

Run by las recognize <files> to operate on files, or las recognize for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .recognition added to the filename.

Report the number of words a particular language processor recognizes. This may be useful for e.g. estimating the number of OCR errors in automatically scanned historical newspapers.

Supported locales: de, en, fi, fr, it, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm, la

Examples: Input: "?l»vatcssaan Satakunnan maanwiljelystotoukscu, joka pidettiin Kautuau tehtaalla Euran pitäjässä" Output: { "locale" : "fi", "recognized" : 7, "unrecognized" : 3, "rate" : 0.7 }

Input: "B»bs letters about the missiing money from the bank had created a huge kerfussle" Output: { "locale" : "en", "recognized" : 11, "unrecognized" : 3, "rate" : 0.7857142857142857 }

Hyphenation

Run by las hyphenate <files> to operate on files, or las hyphenate for stream operation. Add --locale [locale] to force a particular locale. If run on files, the output will be saved to files with the suffix of .hyphenated added to the filename.

Hyphenates the given text. Uses finite state transducers provided by the provided by the HFST, Omorfi and Giellatekno projects. Those provided by HFST have been automatically translated from the TeX CTAN distribution's hyphenation rulesets.

Supported locales: bg, ca, cop, cs, cy, da, el, es, et, eu, fi, fr, ga, gl, hr, hsb, hu, ia, in, is, it, la, liv, mdf, mhr, mn, mrj, myv, nb, nl, nn, pl, pt, ro, ru, sa, sh, sk, sl, sme, sr, sv, tr, udm, uk, zh

Examples: Input: "Albert osti fagotin ja töräytti puhkuvan melodian." Output: "al-bert os-ti fa-go-tin ja tö-räyt-ti puh-ku-van me-lo-dian"

Input: "Månens yta består i stora drag av två olika typer av landskap" Output: "må-nens y-ta be-står i sto-ra d-rag av två o-li-ka ty-per av lan-d-skap"

Things to know when using LAS for analyzing Finnish

While LAS supports many languages, the most complete support it has is for Finnish. However, this also makes the functionality complex. Thus, it is useful to delve deeper into what is actually happening.

First, the Finnish analysis is based on a fork of the Omorfi morphological analyzer for Finnish. What the user needs to know about this is that Omorfi normally provides 1) all possible morphological analyses of a word and 2) only works for words that are included in its lexicon and rules.

To this baseline, the functionality in LAS (or the modified Omorfi) adds: 1. support for better sentence splitting and tokenization from Turku NLP. 1. support for guessing the most probable of multiple analyses 1. by using case matching of the initial letter (if not the first word in a sentence) 1. by using machine learned disambiguation from Turku NLP 1. by using word class and inflection -based rules 1. by using word frequency information from the Finnish Wikipedia 1. lemma guessing for words outside the lexicon 1. support for Early Modern Finnish inflection 1. support for edit-distance error correction (by up to 2 steps) in a guessed analysis 1. automatic dehyphenation

Final note: In analysis, Omorfi supports initial capitalization of words, necessitated by needing to analyze first words in a sentence without fuzz. However, nothing else is done. So, pariisi will return only pari as the lemma, and not Pariisi. (As a sidenote, if you actually do want case insensitive matching, you can thus convert every word into initial uppercase, but that will mess with the disambiguation)

Examples of the various rules in action in lemmatization: * Pariisi -> pari (initial case is ignored for first word in a sentence) * Pariisissa -> Pariisi (cannot be an inflected form of pari) * Pariisi on -> Pariisi olla (machine learned disambiguation guesses correctly) * pariisi on -> pari olla (uppercasing not allowed) * oli Pariisi -> olla Pariisi (case change not allowed after first word in a sentence) * oli pariisi -> olla pari (case change not allowed after first word in a sentence) * kuin -> kuin (instead of kuu, based on word class and inflection rules) * twiittasin -> tviitata, (guessed, twiittasin for --no-guess) * Leh>tim»ehen -> Lehtimies for --max-edit-distance 2 * Helsingin -> Helsinki (instead of the last name Helsing, based on Wikipedia frequency)

Evaluation

Below, LAS lemmatisation accuracy on Finnish is compared to the neural network version of the TurkuNLP parser (see https://universaldependencies.org/conll18/results-lemmas.html).

| Dataset | LAS | TurkuNLP-NN | |---------|-----|-------------| | FI-FTB | 93.44% | 97.02% | | FI-PUD | 93.34% | 95.07% | | FI-TDT | 92.00% | 95.32% |

Contributing

If you encounter problems, open an issue in GitHub. Pull requests also naturally welcome. If you wish to delve deeper into how the tool works, be aware that this repository contains just one of two front ends. Many more lines of code are contained in the seco-lexicalanalysis repository, which contains the code common to this command line version and the web service version (seco-lexicalanalysis-play). They in turn refer to seco-hfst. In addition, the in-depth work on integrating and expanding the Finnish pipeline included in the tool builds heavily on our omorfi fork.

Owner

Name: Human Sciences – Computing Interaction Research Group
Login: hsci-r
Kind: organization
Location: University of Helsinki

Website: http://heldig.fi/hsci/
Twitter: hsci_research
Repositories: 61
Profile: https://github.com/hsci-r

CodeMeta (codemeta.json)

{
  "@context": "https://raw.githubusercontent.com/mbjones/codemeta/master/codemeta.jsonld",
  "@type": "Code",
  "author": [
    {
      "@id": "http://orcid.org/0000-0002-8366-8414",
      "@type": "Person",
      "email": "eetu.makela@iki.fi",
      "name": "Eetu Mäkelä",
      "affiliation": "Aalto University"
    }
  ],
  "identifier": "http://dx.doi.org/10.5281/zenodo.160256",
  "codeRepository": "https://github.com/jiemakel/las",
  "datePublished": "2016-10-12",
  "dateModified": "2016-10-12",
  "dateCreated": "2016-06-28",
  "description": "Lexical Analysis Command-Line Tool for lemmatizing, lexical analysis, inflected form generation and language identification of multiple languages.",
  "keywords": "NLP, lexical analysis, morphological analysis, lemmatization, language identification",
  "license": "MIT",
  "title": "Lexical Analysis Tool",
  "version": "v1.4.9"
}

GitHub Events

Total

Last Year

Committers

Last synced: 5 months ago

All Time

Total Commits: 62
Total Committers: 1
Avg Commits per committer: 62.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Eetu Mäkelä	e**a@i**i	62

Committer Domains (Top 20 + Academic)

iki.fi: 1

Issues and Pull Requests

Last synced: 4 months ago

All Time

Total issues: 4
Total pull requests: 1
Average time to close issues: 12 months
Average time to close pull requests: about 9 hours
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 1.75
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

LAS

Science Score: 49.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

Language Analysis Tool

Installation and running

Optimal mode of running

Functionalities

Language detection

Lemmatization

Morphological analysis

Inflected form generation

Word recognition rate reporting

Hyphenation

Things to know when using LAS for analyzing Finnish

Evaluation

Contributing

Owner

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels