tango-evals
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (2.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: sandbox-ai
- License: mit
- Language: Python
- Default Branch: main
- Size: 24.3 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Tango Evals
Repositorio para reproducir las evaluaciones de laleaderboard con el modelo Tango-70b.
Resultados
Comparación con otros modelos en laleaderboard_es
| Model | Average | AQuAS | Belebele Spa | ClinDiagnosES | ClinTreatES | COPA_es | Crows Pairs Spanish | EsCoLA | Fake News ES | HumorQA | MGSM_es | NoticIA | OffendES | OpenBookQA_es | PAWS-X_es | RagQuAS | SpaLawEx | TELEIA | WNLI ES | XL-Sum_es | XNLI_es | XQuAD_es | xStoryCloze_es | Precision |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tango-70b | 59.90 | 75.78 | 92.00 | 65.72 | 63.43 | 89.60 | 55.96 | 71.79 | 76.57 | 25.49 | 32.40 | 0.86 | 72.64 | 34.80 | 70.95 | 79.87 | 51.26 | 61.90 | 77.46 | 19.71 | 52.37 | 75.16 | 74.72 | bf4 |
| google/gemma-2-9b-it | 33.62 | 85.93 | 86.22 | 83.19 | 81.42 | 78.80 | 17.96 | 34.52 | 62.94 | 45.10 | 0 | 34.11 | 64.52 | 9.33 | 27.60 | 88.01 | 30.53 | 35.72 | 52.11 | 0 | 24.28 | 62.29 | 35.01 | bfloat16 |
| google/gemma-2-9b | 32.97 | 83.02 | 83.26 | 77.77 | 80.93 | 68.80 | 13.59 | 28.79 | 16.00 | 45.10 | 4.80 | 0.23 | 66.33 | 12.00 | 24.70 | 86.79 | 5.88 | 35.72 | 4.23 | 0 | 29.76 | 75.33 | 47.98 | |
| meta-llama/Meta-Llama-3.1-8B-Instruct | 30.23 | 85.31 | 83.56 | 81.75 | 73.40 | 72.00 | 6.03 | 24.24 | 60.14 | 37.25 | 0 | 28.71 | 57.00 | 12.00 | 33.20 | 88.62 | 19.33 | 21.43 | 32.39 | 0 | 25.30 | 69.94 | 35.54 | bfloat16 |
| Qwen/Qwen2.5-7B | 27.61 | 85.37 | 84.89 | 79.25 | 81.90 | 62.00 | 8.81 | 20.72 | 42.66 | 45.10 | 5.20 | 3.93 | 67.03 | 10.67 | 29.60 | 90.43 | 19.33 | 14.29 | 40.85 | 0 | 25.30 | 80.05 | 38.19 | bfloat16 |
| meta-llama/Meta-Llama-3.1-8B | 27.04 | 83.02 | 74.52 | 80.71 | 81.21 | 62.00 | 0 | 11.53 | 19.58 | 45.10 | 1.60 | 2.60 | 66.23 | 13.07 | 30.10 | 90.69 | 5.88 | 0 | 1.41 | 0 | 28.86 | 74.38 | 41.63 | bfloat16 |
| utter-project/EuroLLM-9B | 25.87 | 83.10 | 67.70 | 72.24 | 74.52 | 70.40 | 3.25 | 18.29 | 7.34 | 42.48 | 3.60 | 0.19 | 70.26 | 17.07 | 31.00 | 83.11 | 5.88 | 14.29 | 7.04 | 0 | 27.71 | 76.92 | 44.01 | bfloat16 |
| BSC-LT/salamandra-7b-instruct | 25.13 | 84.13 | 57.33 | 80.38 | 82.03 | 62.00 | 10.67 | 7.68 | 8.74 | 0 | 0 | 19.38 | 67.83 | 14.93 | 19.50 | 88.78 | 18.21 | 21.43 | 9.86 | 0 | 24.28 | 58.31 | 30.38 | bfloat16 |
| utter-project/EuroLLM-9B-Instruct | 24.46 | 84.81 | 69.78 | 80.90 | 77.76 | 72.40 | 11.20 | 24.57 | 38.11 | 26.80 | 0 | 26.80 | 61.91 | 13.60 | 26.10 | 90.79 | 13.73 | 21.43 | 29.58 | 0 | 24.82 | 58.48 | 33.69 | bfloat16 |
| CohereForAI/aya-expanse-8b | 24.30 | 83.45 | 77.78 | 78.88 | 72.24 | 68.00 | 9.21 | 15.53 | 19.58 | 0 | 0 | 0.46 | 62.23 | 8.53 | 33.90 | 89.02 | 13.73 | 50.00 | 38.03 | 0 | 15.79 | 77.98 | 34.08 | float16 |
| BSC-LT/salamandra-7b | 24.04 | 81.93 | 22.07 | 74.68 | 78.11 | 62.80 | 5.37 | 21.46 | 19.58 | 45.10 | 2.40 | 0.17 | 57.27 | 10.40 | 18.60 | 87.78 | 5.88 | 0 | 15.49 | 0 | 26.15 | 69.21 | 46.92 |
Notas:
- Average: Media no ponderada de todas las métricas válidas de todas las tareas (46 valores totales)
- Promedio (Solo Accuracy): Media no ponderada de todas las métricas acc* (22 valores de accuracy)
- Promedio (Todas las métricas): Media no ponderada de todas las métricas válidas de todas las tareas (46 valores totales)
- Los resultados de otros modelos provienen de laleaderboard_es
- Tango-70b destaca especialmente en: Average (59.90), Belebele Spa (92.00), COPAes (89.60), EsCoLA (71.79), RagQuAS (79.87), SpaLawEx (51.26), y XL-Sumes (19.71)
- Tango-70b supera significativamente al segundo mejor modelo (google/gemma-2-9b-it con 33.62) por 26.28 puntos porcentuales
Reproducir los resultados
Creá y activá un virtual-env de Python ≥ 3.9.
python -m venv .venv source .venv/bin/activateInstalá dependencias y el harness en modo editable:
bash
pip install -r requirements.txt
pip install -e .
- Logeate en Hugging Face
bash
huggingface-cli login
- Ejecutá el script de evaluación:
bash
chmod +x run_laleaderboard_es.sh
./run_laleaderboard_es.sh
- Ejecutá el script de agregación de resultados
bash python aggregate_laleaderboard_es_acc.py
El script run_laleaderboard_es.sh recorre cada sub-tarea definida en
lm_eval/tasks/laleaderboard/laleaderboard_es.yaml, ejecutando una a la vez. Apenas una tarea termina se escribe el archivo results_<timestamp>.json, por lo que si el proceso se interrumpe conservás todo lo ya completado.
El script aggregate_laleaderboard_es_acc.py lee todos los archivos results_*.json en tango-evals/ y calcula:
- Media de métricas de accuracy únicamente
- Media de todas las métricas (primera métrica de cada tarea)
Dónde quedan los resultados
Los resultados se guardan en el directorio indicado en OUTPUT_DIR al principio del script (por defecto ./tango-evals). Ejemplo de estructura:
<OUTPUT_DIR>/
└── <model-name-sanitised>/
├── results_2024-05-29T14-52-17.json # métricas de una tarea
└── … # más tareas / timestamps
Los logs de consola por tarea se almacenan en ./logs/ junto al script.
Reanudar o volver a correr
• El script detecta archivos results_*<task>.json existentes y salta esas tareas.
• Podés ajustar MODEL_ARGS, tamaño de batch, dispositivo, etc. editando el encabezado del script.
Hardware
– Hardware usado: 4 × NVIDIA RTX 3090, 256 GB RAM.
– Ajustá batch size o paralelismo según tu GPU.
Owner
- Name: sandbox.ai
- Login: sandbox-ai
- Kind: organization
- Email: sandboxai@protonmail.com
- Repositories: 1
- Profile: https://github.com/sandbox-ai
Inteligencia Artificial - Desarrollo - Innovación
Citation (CITATION.bib)
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = 12,
year = 2023,
publisher = {Zenodo},
version = {v0.4.0},
doi = {10.5281/zenodo.10256836},
url = {https://zenodo.org/records/10256836}
}
GitHub Events
Total
- Watch event: 1
- Push event: 1
- Create event: 1
Last Year
- Watch event: 1
- Push event: 1
- Create event: 1
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- tj-actions/changed-files v44.5.2 composite
- actions/checkout v4 composite
- actions/download-artifact v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- pypa/gh-action-pypi-publish release/v1 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite
- pre-commit/action v3.0.1 composite
- emoji ==2.14.0
- fugashi *
- neologdn ==0.5.3
- rouge_score >=0.1.2
- accelerate >=0.26.0
- datasets >=2.16.0
- dill *
- evaluate >=0.4.0
- evaluate *
- jsonlines *
- more_itertools *
- numexpr *
- peft >=0.2.0
- pybind11 >=2.6.2
- pytablewriter *
- rouge-score >=0.0.4
- sacrebleu >=1.5.0
- scikit-learn >=0.24.1
- sentence-transformers >=2.7
- sqlitedict *
- torch >=1.8
- tqdm-multiprocess *
- transformers >=4.1
- word2number *
- zstandard *