idiomata_cognitor
Language identifier for Romance languages
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary
Repository
Language identifier for Romance languages
Basic Info
- Host: GitHub
- Owner: transducens
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 67.5 MB
Statistics
- Stars: 0
- Watchers: 8
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Idiomata Cognitor
Language identifier for Romance languages
Idiomata Cognitor is a multilingual highly-accurate language classifier focused on a number of Romance languages, trained with Bayesian methods. It complements general language detectors by offering finer classification within Romance languages.
Description
The classifier is able to identify the following languages and language variants: Aragonese, Occitan, Aranese (variant of Occitan spoken in the Aran Valley), Asturian, Catalan, French, Galician, Italian, Portuguese and Spanish.
The model was trained on fragments from the Wikimedia and Wikimatrix corpora, with the exception of Aranese, for which the literary corpus from PILAR was used.
The classification report emitted by the classifier on a multilingual joint corpus of FLORES+ dev sets is as follows:
``` Accuracy: 0.9763289869608827 precision recall f1-score sentences
Spanish 0.95 0.98 0.96 997 Catalan 1.00 0.99 0.99 997 Aragonese 0.96 0.99 0.97 997 Aranese 0.96 0.94 0.95 997 Occitan 0.94 0.96 0.95 997 Asturian 0.99 0.92 0.95 997 Galician 0.98 0.99 0.98 997 Italian 1.00 1.00 1.00 997 French 1.00 1.00 1.00 997 Portuguese 1.00 0.98 0.99 997
accuracy 0.98 9970 macro avg 0.98 0.98 0.98 9970 weighted avg 0.98 0.98 0.98 9970 ``` As of 08/02/2024, the FLORES+ versions of Aragonese and Aranese are not published. They will be released soon as a result of the EMNLP 2024 Shared Task "Translation into Low-Resource Languages of Spain".
Note that the median length of sentences from the FLORES+ dev set is 22 words. It is possible that the results will vary with shorter lengths.
Install
Clone the repository and install the dependencies:
git clone https://github.com/transducens/idiomata_cognitor.git
cd idiomata_cognitor
pip install -r requirements.txt
If you would like to use our trained model, you will need to unzip it.
Usage
To use the classification script, you would need to provide the sentences to be identified via standard input, along with the model to be used as an argument. The output will then be the input sentences along with the corresponding language identifier separated by a tab.
For example, if you have a list of sentences in the file input.txt, you can use the following command:
cat input.txt | python lang_identification.py --model model.pkl
The output will be in the format:
sentence1 language1
sentence2 language2
...
Training
You can use the training script and monolingual corpora to train your own classifier. The script will divide the provided corpora into 70% for training and 30% for testing.
python lang_identification_train.py \
--spa spanish_monolingual_corpus.txt \
--cat catalan_monolingual_corpus.txt \
--arg aragonese_monolingual_corpus.txt \
--arn aranese_monolingual_corpus.txt \
--oci occitan_monolingual_corpus.txt \
--ast asturian_monolingual_corpus.txt \
--ita italian_monolingual_corpus.txt \
--glg galician_monolingual_corpus.txt \
--fra french_monolingual_corpus.txt \
--por portuguese_monolingual_corpus.txt \
--output-model your_model.pkl
Once training is complete, the script will produce a classification report similar to the one shown in the Description section above. This report will be generated over the 30% of the corpora that was reserved for testing.
Citing this work
If you use this tool as part of your developments, please cite it as follows:
@misc{idiomatacognitor,
author = {Galiano-Jimnez, Aarn and Snchez-Martnez, Felipe and Prez-Ortiz, Juan Antonio},
title = {Idiomata Cognitor},
url = {https://github.com/transducens/idiomata_cognitor},
year = {2024}
}
A CITATION.cff file is also included in this repository.
Acknowledgements
This tool has been produced as part of the research project Lightweight neural translation technologies for low-resource languages (LiLowLa) (PID2021-127999NB-I00) funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe.
Owner
- Name: Transducens
- Login: transducens
- Kind: organization
- Email: info.transducens@dlsi.ua.es
- Location: Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant 03690 Sant Vicent del Raspeig, Spain
- Website: http://transducens.dlsi.ua.es/
- Repositories: 26
- Profile: https://github.com/transducens