mayanv

Hosts a number of bilingual Mayan-Spanish corpora

https://github.com/transducens/mayanv

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Hosts a number of bilingual Mayan-Spanish corpora

Basic Info
  • Host: GitHub
  • Owner: transducens
  • License: cc0-1.0
  • Language: JavaScript
  • Default Branch: master
  • Size: 1.32 MB
Statistics
  • Stars: 6
  • Watchers: 8
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created about 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

MayanV, Mayan-Spanish parallel corpora

This repository contains MayanV, a collection of parallel corpora between several Mayan languages and Spanish. MayanV is introduced in the paper "Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars", accepted at the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL2024.

Included languages

MayanV includes curated parallel corpora for Spanish and the following Mayan languages spoken in Guatemala and Southern Mexico:

| ISO Code | Language | Words (Mayan) | Words (Spanish) | Sentences | |----------|-------------|--------------:|----------------:|----------:| | acr | Achi | 6,994 | 7,657 | 1,343 | | agu | Awakatec | 7,325 | 9,700 | 1,930 | | cac | Chuj | 9,398 | 10,916 | 2,299 | | itz | Itza’ | 6,069 | 7,512 | 1,539 | | ixl | Ixil | 10,888 | 12,137 | 2,325 | | kek | Q’eqchi’ | 18,529 | 21,835 | 4,133 | | kjb | Q’anjob’al | 18,035 | 18,238 | 3,014 | | mam | Mam | 15,453 | 19,117 | 3,093 | | poc | Poqomam | 18,039 | 21,744 | 3,583 | | poh | Poqomchi’ | 6,479 | 7,149 | 1,787 | | quc | K’iche’ | 14,468 | 15,474 | 2,632 | | qum | Sipakapense | 9,780 | 9,328 | 1,356 | | ttc | Tektitek | 23,571 | 24,896 | 4,022 | | tzh | Tzeltal | 103,309 | 128,659 | 19,846 | | tzj | Tz’utujil | 12,283 | 11,404 | 2,519 |

Sources for each corpus are discussed in the article. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language.

Structure

Each language corpus is organized into its respective folder within the repository. Additionally, each language folder contains its own README file providing details about the resources used to create the corpus. Language folders: Achi, Awakatec, Chuj, Itza’, Ixil, Q’eqchi’, Q’anjob’al, Mam, Poqomam, Poqomchi’, K’iche’, Sipakapense, Tektitek, Tzeltal, Tz’utujil.

Acknowledgments

MayanV has been produced as part of the R+D+i project Lightweight neural translation technologies for low-resource languages (LiLowLa) (PID2021-127999NB-I00) funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe.

License

These data are released under this licensing scheme: * We do not own any of the text from which these data has been extracted. * We license the actual packaging of these parallel data under the Creative Commons CC0 license ("no rights reserved").

Citing this work

If you use this dataset as part of your developments, please cite it as follows:

@inproceedings{lou-etal-2024-curated, title = "Curated Datasets and Neural Models for Machine Translation of Informal Registers between {M}ayan and {S}panish Vernaculars", author = "Lou, Andr{\'e}s and P{\'e}rez-Ortiz, Juan Antonio and S{\'a}nchez-Mart{\'\i}nez, Felipe and S{\'a}nchez-Cartagena, V{\'\i}ctor", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.156", pages = "2838--2850", }

A CITATION.cff file is also included in this repository.

Owner

  • Name: Transducens
  • Login: transducens
  • Kind: organization
  • Email: info.transducens@dlsi.ua.es
  • Location: Departament de Llenguatges i Sistemes Informàtics Universitat d’Alacant 03690 Sant Vicent del Raspeig, Spain

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: MayanV
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Andrés
    family-names: Lou
    email: and_lou@ua.es
    affiliation: Universitat d'Alacant
  - given-names: Felipe
    family-names: Sánchez Martínez
    email: fsanchez@dlsi.ua.es
    affiliation: Universitat d'Alacant
  - given-names: Víctor
    name-particle: M
    family-names: Sánchez Cartagena
    email: vmsanchez@dlsi.ua.es
    affiliation: Universitat d'Alacant
  - given-names: Juan Antonio
    family-names: Pérez Ortiz
    email: japerez@dlsi.ua.es
    affiliation: Universitat d'Alacant
repository-code: 'https://github.com/transducens/mayanv'
url: 'https://transducens.github.io/nmt-maya/'
abstract: >-
  The Mayan languages comprise a language family with an
  ancient history, millions of speakers, and immense
  cultural value, that, nevertheless, remains severely
  underrepresented in terms of resources and global
  exposure. In this paper we develop, curate, and publicly
  release a set of corpora in several Mayan languages spoken
  in Guatemala and Southern Mexico, which we call MayanV.
  The datasets are parallel with Spanish, the dominant
  language of the region, and are taken from official native
  sources focused on representing informal, day-to-day, and
  non-domain-specific language. As such, and according to
  our dialectometric analysis, they differ in register from
  most other available resources. Additionally, we present
  and release neural machine translation models, trained on
  as many resources and Mayan languages as possible, and
  evaluated exclusively on our datasets. We observe lexical
  divergences between the dialects of Spanish used in the
  resources we present and the more widespread written
  standard of Spanish, and that resources other than the
  ones we present do not seem to improve translation
  performance, indicating that many such resources may not
  accurately capture common, real-life language usage.
keywords:
  - naacl2024
  - mayan_corpora
  - mayanv
  - low_resource_languages
license: CC0-1.0
commit: b6029f356b32c3d320d6adf21700eaa34edcfa7b
version: '1.0'
date-released: '2024-03-22'

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2