idiomata_cognitor

Language identifier for Romance languages

https://github.com/transducens/idiomata_cognitor

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary
Last synced: 5 months ago · JSON representation

Repository

Language identifier for Romance languages

Basic Info
  • Host: GitHub
  • Owner: transducens
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 67.5 MB
Statistics
  • Stars: 0
  • Watchers: 8
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Idiomata Cognitor

Language identifier for Romance languages

Idiomata Cognitor is a multilingual highly-accurate language classifier focused on a number of Romance languages, trained with Bayesian methods. It complements general language detectors by offering finer classification within Romance languages.

Description

The classifier is able to identify the following languages and language variants: Aragonese, Occitan, Aranese (variant of Occitan spoken in the Aran Valley), Asturian, Catalan, French, Galician, Italian, Portuguese and Spanish.

The model was trained on fragments from the Wikimedia and Wikimatrix corpora, with the exception of Aranese, for which the literary corpus from PILAR was used.

The classification report emitted by the classifier on a multilingual joint corpus of FLORES+ dev sets is as follows:

``` Accuracy: 0.9763289869608827 precision recall f1-score sentences

Spanish 0.95 0.98 0.96 997 Catalan 1.00 0.99 0.99 997 Aragonese 0.96 0.99 0.97 997 Aranese 0.96 0.94 0.95 997 Occitan 0.94 0.96 0.95 997 Asturian 0.99 0.92 0.95 997 Galician 0.98 0.99 0.98 997 Italian 1.00 1.00 1.00 997 French 1.00 1.00 1.00 997 Portuguese 1.00 0.98 0.99 997

accuracy 0.98 9970 macro avg 0.98 0.98 0.98 9970 weighted avg 0.98 0.98 0.98 9970 ``` As of 08/02/2024, the FLORES+ versions of Aragonese and Aranese are not published. They will be released soon as a result of the EMNLP 2024 Shared Task "Translation into Low-Resource Languages of Spain".

Note that the median length of sentences from the FLORES+ dev set is 22 words. It is possible that the results will vary with shorter lengths.

Install

Clone the repository and install the dependencies:

git clone https://github.com/transducens/idiomata_cognitor.git cd idiomata_cognitor pip install -r requirements.txt

If you would like to use our trained model, you will need to unzip it.

Usage

To use the classification script, you would need to provide the sentences to be identified via standard input, along with the model to be used as an argument. The output will then be the input sentences along with the corresponding language identifier separated by a tab.

For example, if you have a list of sentences in the file input.txt, you can use the following command:

cat input.txt | python lang_identification.py --model model.pkl

The output will be in the format:

sentence1 language1 sentence2 language2 ...

Training

You can use the training script and monolingual corpora to train your own classifier. The script will divide the provided corpora into 70% for training and 30% for testing.

python lang_identification_train.py \ --spa spanish_monolingual_corpus.txt \ --cat catalan_monolingual_corpus.txt \ --arg aragonese_monolingual_corpus.txt \ --arn aranese_monolingual_corpus.txt \ --oci occitan_monolingual_corpus.txt \ --ast asturian_monolingual_corpus.txt \ --ita italian_monolingual_corpus.txt \ --glg galician_monolingual_corpus.txt \ --fra french_monolingual_corpus.txt \ --por portuguese_monolingual_corpus.txt \ --output-model your_model.pkl

Once training is complete, the script will produce a classification report similar to the one shown in the Description section above. This report will be generated over the 30% of the corpora that was reserved for testing.

Citing this work

If you use this tool as part of your developments, please cite it as follows:

@misc{idiomatacognitor, author = {Galiano-Jimnez, Aarn and Snchez-Martnez, Felipe and Prez-Ortiz, Juan Antonio}, title = {Idiomata Cognitor}, url = {https://github.com/transducens/idiomata_cognitor}, year = {2024} }

A CITATION.cff file is also included in this repository.

Acknowledgements

This tool has been produced as part of the research project Lightweight neural translation technologies for low-resource languages (LiLowLa) (PID2021-127999NB-I00) funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe.

Owner

  • Name: Transducens
  • Login: transducens
  • Kind: organization
  • Email: info.transducens@dlsi.ua.es
  • Location: Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant 03690 Sant Vicent del Raspeig, Spain