Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Keywords
Repository
The Cuban Language Model
Basic Info
- Host: GitHub
- Owner: gia-uh
- License: mit
- Language: TeX
- Default Branch: main
- Homepage: https://cecilia.uhgia.org/
- Size: 4 MB
Statistics
- Stars: 26
- Watchers: 3
- Forks: 0
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
Cecilia is a family of language models continual pretrained specifically on Cuban written text, capturing the linguistic, cultural, and social nuances of Cuban Spanish. These models are designed to support natural language processing tasks with a focus on Cuban language varieties and cultural context.
About Cecilia 2B v0.1
This repository contains Cecilia 2B v0.1, a 2 billion parameter model continual pretrained on Cuban written text from Salamandra 2B.
The model is developed by the Artificial Intelligence Research Group (GIA-UH) at the University of Havana with the collaboration of Language Processing and Information Systems Group (GPLSI) at the University of Alicante and the support of Syalia SRL and Epistemial.
Training Data
Cecilia Tiny was continual pretrained for 2 full epochs on a private corpus of approximately 1000 million tokens of Cuban written text, including:
- 10 years of the most relevant Cuban newspapers.
- The Cuban Encyclopedia (ecured.cu).
- The complete collection of Cuban laws.
- Over 400 important Cuban literary works.
- Several local encyclopedias documenting Cubanisms and cultural elements.
- Hundreds of song lyrics from popular Cuban singers.
This diverse dataset ensures that Cecilia captures a rich spectrum of Cuban language, culture, and history.
Model Architecture and Training
- Based on the Salamandra 2B architecture.
- Fine-tuned using continual pretraining and instruction tuning techniques.
- Optimized for Cuban Spanish linguistic features and cultural context.
Use Cases
Cecilia can be used for various NLP tasks involving Cuban Spanish, such as:
- Text generation and completion.
- Sentiment analysis on Cuban social media or literature.
- Named entity recognition with Cuban-specific entities.
- Machine translation and language understanding tailored to Cuban Spanish.
- Research on Cuban linguistic phenomena and cultural studies.
How to Use
You can easily load and use Cecilia Tiny (2B) v-0.1 with the Hugging Face Transformers library. Here is a simple example in Python:
```python from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "gia-uh/cecilia-2b-v0.1"
Load tokenizer and model
tokenizer = AutoTokenizer.frompretrained(modelid) model = AutoModelForCausalLM.frompretrained(modelid)
Example usage
inputtext = "¿Cómo están las guaguas en La Habana?" inputs = tokenizer(inputtext, returntensors="pt") outputs = model.generate(**inputs, maxlength=50) print(tokenizer.decode(outputs[0], skipspecialtokens=True)) ```
Compatibility
- vLLM: Cecilia Tiny can be used with vLLM for efficient inference and serving.
- LM Studio: The model is compatible with LM Studio, enabling easy local deployment and experimentation.
Model Details
- The model is currently not quantized. Quantized versions will be released shortly to improve efficiency and reduce resource requirements.
- Cecilia Tiny is fine-tuned via continual pretraining on Cuban text but yet is not instruction-tuned. It is optimized for language modeling rather than instruction-following tasks. Instruction-tuned versions will be released soon.
License and Usage
Cecilia Tiny (2B) v-0.1 is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. This allows both research and commercial use, provided that appropriate credit is given and any derivative works are shared under the same license.
Important: Access to download the model requires a manual review of requests to ensure fair and responsible use aligned with the spirit of the license and the cultural sensitivity of the data. Please submit your request with a brief description of your intended use.
Ethical Considerations
Cecilia is a powerful language model fine-tuned on Cuban written text, but it is important to recognize its limitations and use it responsibly:
- Potential for Errors and Hallucinations: Like all large language models, Cecilia can generate incorrect, misleading, or biased information. It may "hallucinate" facts or produce outputs that do not reflect reality or the nuances of Cuban culture perfectly.
- Not a Substitute for Professional Advice: Cecilia should not be used for medical, legal, financial, or other professional advice. Outputs should be carefully reviewed by qualified experts before any decision-making.
- Bias and Fairness: Despite efforts to curate training data, the model may still reflect biases present in source texts. Users should be aware of potential cultural, social, or linguistic biases and interpret results accordingly.
- Privacy and Data Use: The model was trained on publicly available and licensed Cuban texts. Users should respect privacy and copyright laws when applying the model.
- Responsible Use: We encourage users to apply Cecilia in ways that respect Cuban culture and society, avoid harm, and promote fairness and inclusivity.
- Transparency: Users should clearly communicate when content is generated by Cecilia to avoid confusion or misattribution.
By using Cecilia, you agree to apply it ethically and responsibly, understanding its limitations and the cultural sensitivity embedded in its design.
Citation
If you use Cecilia 2B v0.1 - The Cuban Language Model in your research, please cite it as:
Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Alejandro Piad-Morffis, and Yudivian Almeida-Cruz. (2025). cecilia-2b-v0.1 (Revision 1921f36). Hugging Face. https://huggingface.co/gia-uh/cecilia-2b-v0.1. DOI: 10.57967/hf/5667
If using LaTeX, please use the following bibTeX entry:
bibtex
@misc{cecilia2b,
author = { Ernesto L. Estevanell-Valladares and Suilan Estevez-Velarde and Alejandro Piad-Morffis and Yudivian Almeida-Cruz },
title = { Cecilia 2B v0.1 - The Cuban Language Model },
year = 2025,
url = { https://huggingface.co/gia-uh/cecilia-2b-v0.1 },
doi = { 10.57967/hf/5667 },
publisher = { Hugging Face }
}
Team
The model could not have been created without the commitment and work of members of GIA-UH and GPLSI groups.
GIA-UH - Ernesto L. Estevanell, Daniel A. Valdés, Roberto Marti, Deborah Famadas, Roberto García, Gabriel Hernández, Elena Rodríguez, Niley González, Alejandro Beltrán, Juan Pablo Consuegra, Suilan Estévez, Alejandro Piad, Yudivián Almeida.
GPLSI - Robiert Sepúlveda, Yoan Gutiérrez, Rafael Muñoz, Andrés Montoyo, Manuel Palomar.
Acknowledgments
We thank all contributors and data providers who made this work possible.
This work was partially funded by the ILENIA-VIVES project 2022/TL22/00215334, and by private funding from Syalia SRL and Epistemial.
Owner
- Name: gia-uh
- Login: gia-uh
- Kind: organization
- Repositories: 6
- Profile: https://github.com/gia-uh
Citation (CITATION.cff)
cff-version: 1.2.0
title: "Cecilia 2B v0.1 - The Cuban Language Model"
version: "0.1"
message: "If you use Cecilia 2B v0.1 in your research, please cite this work as below."
authors:
- family-names: Estevanell-Valladares
given-names: Ernesto L.
- family-names: Estevez-Velarde
given-names: Suilan
- family-names: Piad-Morffis
given-names: Alejandro
- family-names: Almeida-Cruz
given-names: Yudivian
date-released: 2025-05-29
doi: 10.57967/hf/5667
url: https://huggingface.co/gia-uh/cecilia-2b-v0.1
repository-code: https://huggingface.co/gia-uh/cecilia-2b-v0.1
license: CC-BY-SA-4.0
type: software
keywords:
- language model
- Cuban Spanish
- NLP
- deep learning
- transformers
GitHub Events
Total
- Watch event: 19
- Member event: 1
- Push event: 24
- Public event: 1
- Pull request event: 1
Last Year
- Watch event: 19
- Member event: 1
- Push event: 24
- Public event: 1
- Pull request event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- deborahfam (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- httpx >=0.28.1
- libzim >=3.7.0
- markitdown [all]>=0.1.1
- pandas >=2.2.3
- tabulate >=0.9.0
- tiktoken >=0.9.0
- tqdm >=4.67.1
- anyio 4.9.0
- audioop-lts 0.2.1
- azure-ai-documentintelligence 1.0.2
- azure-core 1.33.0
- azure-identity 1.21.0
- beautifulsoup4 4.13.4
- cecilia 0.1.0
- certifi 2025.1.31
- cffi 1.17.1
- charset-normalizer 3.4.1
- click 8.1.8
- cobble 0.1.4
- colorama 0.4.6
- coloredlogs 15.0.1
- cryptography 44.0.2
- defusedxml 0.7.1
- et-xmlfile 2.0.0
- exceptiongroup 1.2.2
- flatbuffers 25.2.10
- h11 0.16.0
- httpcore 1.0.9
- httpx 0.28.1
- humanfriendly 10.0
- idna 3.10
- isodate 0.7.2
- libzim 3.7.0
- lxml 5.4.0
- magika 0.6.1
- mammoth 1.9.0
- markdownify 1.1.0
- markitdown 0.1.1
- mpmath 1.3.0
- msal 1.32.0
- msal-extensions 1.3.1
- numpy 2.2.5
- olefile 0.47
- onnxruntime 1.21.1
- openpyxl 3.1.5
- packaging 25.0
- pandas 2.2.3
- pdfminer-six 20250416
- pillow 11.2.1
- protobuf 6.30.2
- pycparser 2.22
- pydub 0.25.1
- pyjwt 2.10.1
- pyreadline3 3.5.4
- python-dateutil 2.9.0.post0
- python-dotenv 1.1.0
- python-pptx 1.0.2
- pytz 2025.2
- regex 2024.11.6
- requests 2.32.3
- six 1.17.0
- sniffio 1.3.1
- soupsieve 2.7
- speechrecognition 3.14.2
- standard-aifc 3.13.0
- standard-chunk 3.13.0
- sympy 1.13.3
- tabulate 0.9.0
- tiktoken 0.9.0
- tqdm 4.67.1
- typing-extensions 4.13.2
- tzdata 2025.2
- urllib3 2.4.0
- xlrd 2.0.1
- xlsxwriter 3.2.3
- youtube-transcript-api 1.0.3