Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.5%) to scientific vocabulary
Repository
Hosts a number of bilingual Mayan-Spanish corpora
Basic Info
- Host: GitHub
- Owner: transducens
- License: cc0-1.0
- Language: JavaScript
- Default Branch: master
- Size: 1.32 MB
Statistics
- Stars: 6
- Watchers: 8
- Forks: 1
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
MayanV, Mayan-Spanish parallel corpora
This repository contains MayanV, a collection of parallel corpora between several Mayan languages and Spanish. MayanV is introduced in the paper "Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars", accepted at the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL2024.
Included languages
MayanV includes curated parallel corpora for Spanish and the following Mayan languages spoken in Guatemala and Southern Mexico:
| ISO Code | Language | Words (Mayan) | Words (Spanish) | Sentences | |----------|-------------|--------------:|----------------:|----------:| | acr | Achi | 6,994 | 7,657 | 1,343 | | agu | Awakatec | 7,325 | 9,700 | 1,930 | | cac | Chuj | 9,398 | 10,916 | 2,299 | | itz | Itza’ | 6,069 | 7,512 | 1,539 | | ixl | Ixil | 10,888 | 12,137 | 2,325 | | kek | Q’eqchi’ | 18,529 | 21,835 | 4,133 | | kjb | Q’anjob’al | 18,035 | 18,238 | 3,014 | | mam | Mam | 15,453 | 19,117 | 3,093 | | poc | Poqomam | 18,039 | 21,744 | 3,583 | | poh | Poqomchi’ | 6,479 | 7,149 | 1,787 | | quc | K’iche’ | 14,468 | 15,474 | 2,632 | | qum | Sipakapense | 9,780 | 9,328 | 1,356 | | ttc | Tektitek | 23,571 | 24,896 | 4,022 | | tzh | Tzeltal | 103,309 | 128,659 | 19,846 | | tzj | Tz’utujil | 12,283 | 11,404 | 2,519 |
Sources for each corpus are discussed in the article. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language.
Structure
Each language corpus is organized into its respective folder within the repository. Additionally, each language folder contains its own README file providing details about the resources used to create the corpus. Language folders: Achi, Awakatec, Chuj, Itza’, Ixil, Q’eqchi’, Q’anjob’al, Mam, Poqomam, Poqomchi’, K’iche’, Sipakapense, Tektitek, Tzeltal, Tz’utujil.
Acknowledgments
MayanV has been produced as part of the R+D+i project Lightweight neural translation technologies for low-resource languages (LiLowLa) (PID2021-127999NB-I00) funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe.
License
These data are released under this licensing scheme: * We do not own any of the text from which these data has been extracted. * We license the actual packaging of these parallel data under the Creative Commons CC0 license ("no rights reserved").
Citing this work
If you use this dataset as part of your developments, please cite it as follows:
@inproceedings{lou-etal-2024-curated,
title = "Curated Datasets and Neural Models for Machine Translation of Informal Registers between {M}ayan and {S}panish Vernaculars",
author = "Lou, Andr{\'e}s and
P{\'e}rez-Ortiz, Juan Antonio and
S{\'a}nchez-Mart{\'\i}nez, Felipe and
S{\'a}nchez-Cartagena, V{\'\i}ctor",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.156",
pages = "2838--2850",
}
A CITATION.cff file is also included in this repository.
Owner
- Name: Transducens
- Login: transducens
- Kind: organization
- Email: info.transducens@dlsi.ua.es
- Location: Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant 03690 Sant Vicent del Raspeig, Spain
- Website: http://transducens.dlsi.ua.es/
- Repositories: 26
- Profile: https://github.com/transducens
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: MayanV
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Andrés
family-names: Lou
email: and_lou@ua.es
affiliation: Universitat d'Alacant
- given-names: Felipe
family-names: Sánchez Martínez
email: fsanchez@dlsi.ua.es
affiliation: Universitat d'Alacant
- given-names: Víctor
name-particle: M
family-names: Sánchez Cartagena
email: vmsanchez@dlsi.ua.es
affiliation: Universitat d'Alacant
- given-names: Juan Antonio
family-names: Pérez Ortiz
email: japerez@dlsi.ua.es
affiliation: Universitat d'Alacant
repository-code: 'https://github.com/transducens/mayanv'
url: 'https://transducens.github.io/nmt-maya/'
abstract: >-
The Mayan languages comprise a language family with an
ancient history, millions of speakers, and immense
cultural value, that, nevertheless, remains severely
underrepresented in terms of resources and global
exposure. In this paper we develop, curate, and publicly
release a set of corpora in several Mayan languages spoken
in Guatemala and Southern Mexico, which we call MayanV.
The datasets are parallel with Spanish, the dominant
language of the region, and are taken from official native
sources focused on representing informal, day-to-day, and
non-domain-specific language. As such, and according to
our dialectometric analysis, they differ in register from
most other available resources. Additionally, we present
and release neural machine translation models, trained on
as many resources and Mayan languages as possible, and
evaluated exclusively on our datasets. We observe lexical
divergences between the dialects of Spanish used in the
resources we present and the more widespread written
standard of Spanish, and that resources other than the
ones we present do not seem to improve translation
performance, indicating that many such resources may not
accurately capture common, real-life language usage.
keywords:
- naacl2024
- mayan_corpora
- mayanv
- low_resource_languages
license: CC0-1.0
commit: b6029f356b32c3d320d6adf21700eaa34edcfa7b
version: '1.0'
date-released: '2024-03-22'
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2