cecilia

The Cuban Language Model

https://github.com/gia-uh/cecilia

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Keywords

language-model llm slm

Last synced: 11 months ago · JSON representation ·

Repository

The Cuban Language Model

Basic Info

Host: GitHub
Owner: gia-uh
License: mit
Language: TeX
Default Branch: main
Homepage: https://cecilia.uhgia.org/
Size: 4 MB

Statistics

Stars: 26
Watchers: 3
Forks: 0
Open Issues: 1
Releases: 0

Topics

language-model llm slm

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

Cecilia is a family of language models continual pretrained specifically on Cuban written text, capturing the linguistic, cultural, and social nuances of Cuban Spanish. These models are designed to support natural language processing tasks with a focus on Cuban language varieties and cultural context.

About Cecilia 2B v0.1

This repository contains Cecilia 2B v0.1, a 2 billion parameter model continual pretrained on Cuban written text from Salamandra 2B.

The model is developed by the Artificial Intelligence Research Group (GIA-UH) at the University of Havana with the collaboration of Language Processing and Information Systems Group (GPLSI) at the University of Alicante and the support of Syalia SRL and Epistemial.

Training Data

Cecilia Tiny was continual pretrained for 2 full epochs on a private corpus of approximately 1000 million tokens of Cuban written text, including:

10 years of the most relevant Cuban newspapers.
The Cuban Encyclopedia (ecured.cu).
The complete collection of Cuban laws.
Over 400 important Cuban literary works.
Several local encyclopedias documenting Cubanisms and cultural elements.
Hundreds of song lyrics from popular Cuban singers.

This diverse dataset ensures that Cecilia captures a rich spectrum of Cuban language, culture, and history.

Model Architecture and Training

Based on the Salamandra 2B architecture.
Fine-tuned using continual pretraining and instruction tuning techniques.
Optimized for Cuban Spanish linguistic features and cultural context.

Use Cases

Cecilia can be used for various NLP tasks involving Cuban Spanish, such as:

Text generation and completion.
Sentiment analysis on Cuban social media or literature.
Named entity recognition with Cuban-specific entities.
Machine translation and language understanding tailored to Cuban Spanish.
Research on Cuban linguistic phenomena and cultural studies.

How to Use

You can easily load and use Cecilia Tiny (2B) v-0.1 with the Hugging Face Transformers library. Here is a simple example in Python:

```python from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "gia-uh/cecilia-2b-v0.1"

Load tokenizer and model

tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained(modelid)

Example usage

inputtext = "¿Cómo están las guaguas en La Habana?" inputs = tokenizer(inputtext, returntensors="pt") outputs = model.generate(**inputs, maxlength=50) print(tokenizer.decode(outputs[0], skipspecialtokens=True)) ```

Compatibility

vLLM: Cecilia Tiny can be used with vLLM for efficient inference and serving.
LM Studio: The model is compatible with LM Studio, enabling easy local deployment and experimentation.

Model Details

The model is currently not quantized. Quantized versions will be released shortly to improve efficiency and reduce resource requirements.
Cecilia Tiny is fine-tuned via continual pretraining on Cuban text but yet is not instruction-tuned. It is optimized for language modeling rather than instruction-following tasks. Instruction-tuned versions will be released soon.

License and Usage

Cecilia Tiny (2B) v-0.1 is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. This allows both research and commercial use, provided that appropriate credit is given and any derivative works are shared under the same license.

Important: Access to download the model requires a manual review of requests to ensure fair and responsible use aligned with the spirit of the license and the cultural sensitivity of the data. Please submit your request with a brief description of your intended use.

Ethical Considerations

Cecilia is a powerful language model fine-tuned on Cuban written text, but it is important to recognize its limitations and use it responsibly:

Potential for Errors and Hallucinations: Like all large language models, Cecilia can generate incorrect, misleading, or biased information. It may "hallucinate" facts or produce outputs that do not reflect reality or the nuances of Cuban culture perfectly.
Not a Substitute for Professional Advice: Cecilia should not be used for medical, legal, financial, or other professional advice. Outputs should be carefully reviewed by qualified experts before any decision-making.
Bias and Fairness: Despite efforts to curate training data, the model may still reflect biases present in source texts. Users should be aware of potential cultural, social, or linguistic biases and interpret results accordingly.
Privacy and Data Use: The model was trained on publicly available and licensed Cuban texts. Users should respect privacy and copyright laws when applying the model.
Responsible Use: We encourage users to apply Cecilia in ways that respect Cuban culture and society, avoid harm, and promote fairness and inclusivity.
Transparency: Users should clearly communicate when content is generated by Cecilia to avoid confusion or misattribution.

By using Cecilia, you agree to apply it ethically and responsibly, understanding its limitations and the cultural sensitivity embedded in its design.

Citation

If you use Cecilia 2B v0.1 - The Cuban Language Model in your research, please cite it as:

Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Alejandro Piad-Morffis, and Yudivian Almeida-Cruz. (2025). cecilia-2b-v0.1 (Revision 1921f36). Hugging Face. https://huggingface.co/gia-uh/cecilia-2b-v0.1. DOI: 10.57967/hf/5667

If using LaTeX, please use the following bibTeX entry:

bibtex @misc{cecilia2b, author = { Ernesto L. Estevanell-Valladares and Suilan Estevez-Velarde and Alejandro Piad-Morffis and Yudivian Almeida-Cruz }, title = { Cecilia 2B v0.1 - The Cuban Language Model }, year = 2025, url = { https://huggingface.co/gia-uh/cecilia-2b-v0.1 }, doi = { 10.57967/hf/5667 }, publisher = { Hugging Face } }

Team

The model could not have been created without the commitment and work of members of GIA-UH and GPLSI groups.

GIA-UH - Ernesto L. Estevanell, Daniel A. Valdés, Roberto Marti, Deborah Famadas, Roberto García, Gabriel Hernández, Elena Rodríguez, Niley González, Alejandro Beltrán, Juan Pablo Consuegra, Suilan Estévez, Alejandro Piad, Yudivián Almeida.

GPLSI - Robiert Sepúlveda, Yoan Gutiérrez, Rafael Muñoz, Andrés Montoyo, Manuel Palomar.

Acknowledgments

We thank all contributors and data providers who made this work possible.

This work was partially funded by the ILENIA-VIVES project 2022/TL22/00215334, and by private funding from Syalia SRL and Epistemial.

Owner

Name: gia-uh
Login: gia-uh
Kind: organization

Repositories: 6
Profile: https://github.com/gia-uh

Citation (CITATION.cff)

cff-version: 1.2.0
title: "Cecilia 2B v0.1 - The Cuban Language Model"
version: "0.1"
message: "If you use Cecilia 2B v0.1 in your research, please cite this work as below."
authors:
  - family-names: Estevanell-Valladares
    given-names: Ernesto L.
  - family-names: Estevez-Velarde
    given-names: Suilan
  - family-names: Piad-Morffis
    given-names: Alejandro
  - family-names: Almeida-Cruz
    given-names: Yudivian
date-released: 2025-05-29
doi: 10.57967/hf/5667
url: https://huggingface.co/gia-uh/cecilia-2b-v0.1
repository-code: https://huggingface.co/gia-uh/cecilia-2b-v0.1
license: CC-BY-SA-4.0
type: software
keywords:
  - language model
  - Cuban Spanish
  - NLP
  - deep learning
  - transformers

GitHub Events

Total

Watch event: 19
Member event: 1
Push event: 24
Public event: 1
Pull request event: 1

Last Year

Watch event: 19
Member event: 1
Push event: 24
Public event: 1
Pull request event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

deborahfam (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

httpx >=0.28.1
libzim >=3.7.0
markitdown [all]>=0.1.1
pandas >=2.2.3
tabulate >=0.9.0
tiktoken >=0.9.0
tqdm >=4.67.1

uv.lock pypi

anyio 4.9.0
audioop-lts 0.2.1
azure-ai-documentintelligence 1.0.2
azure-core 1.33.0
azure-identity 1.21.0
beautifulsoup4 4.13.4
cecilia 0.1.0
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
click 8.1.8
cobble 0.1.4
colorama 0.4.6
coloredlogs 15.0.1
cryptography 44.0.2
defusedxml 0.7.1
et-xmlfile 2.0.0
exceptiongroup 1.2.2
flatbuffers 25.2.10
h11 0.16.0
httpcore 1.0.9
httpx 0.28.1
humanfriendly 10.0
idna 3.10
isodate 0.7.2
libzim 3.7.0
lxml 5.4.0
magika 0.6.1
mammoth 1.9.0
markdownify 1.1.0
markitdown 0.1.1
mpmath 1.3.0
msal 1.32.0
msal-extensions 1.3.1
numpy 2.2.5
olefile 0.47
onnxruntime 1.21.1
openpyxl 3.1.5
packaging 25.0
pandas 2.2.3
pdfminer-six 20250416
pillow 11.2.1
protobuf 6.30.2
pycparser 2.22
pydub 0.25.1
pyjwt 2.10.1
pyreadline3 3.5.4
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
python-pptx 1.0.2
pytz 2025.2
regex 2024.11.6
requests 2.32.3
six 1.17.0
sniffio 1.3.1
soupsieve 2.7
speechrecognition 3.14.2
standard-aifc 3.13.0
standard-chunk 3.13.0
sympy 1.13.3
tabulate 0.9.0
tiktoken 0.9.0
tqdm 4.67.1
typing-extensions 4.13.2
tzdata 2025.2
urllib3 2.4.0
xlrd 2.0.1
xlsxwriter 3.2.3
youtube-transcript-api 1.0.3