Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (2.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: sandbox-ai
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 24.3 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License Citation Codeowners

README.md

Tango Evals

Repositorio para reproducir las evaluaciones de laleaderboard con el modelo Tango-70b.

Resultados

Comparación con otros modelos en laleaderboard_es

Model Average AQuAS Belebele Spa ClinDiagnosES ClinTreatES COPA_es Crows Pairs Spanish EsCoLA Fake News ES HumorQA MGSM_es NoticIA OffendES OpenBookQA_es PAWS-X_es RagQuAS SpaLawEx TELEIA WNLI ES XL-Sum_es XNLI_es XQuAD_es xStoryCloze_es Precision
Tango-70b 59.90 75.78 92.00 65.72 63.43 89.60 55.96 71.79 76.57 25.49 32.40 0.86 72.64 34.80 70.95 79.87 51.26 61.90 77.46 19.71 52.37 75.16 74.72 bf4
google/gemma-2-9b-it 33.62 85.93 86.22 83.19 81.42 78.80 17.96 34.52 62.94 45.10 0 34.11 64.52 9.33 27.60 88.01 30.53 35.72 52.11 0 24.28 62.29 35.01 bfloat16
google/gemma-2-9b 32.97 83.02 83.26 77.77 80.93 68.80 13.59 28.79 16.00 45.10 4.80 0.23 66.33 12.00 24.70 86.79 5.88 35.72 4.23 0 29.76 75.33 47.98
meta-llama/Meta-Llama-3.1-8B-Instruct 30.23 85.31 83.56 81.75 73.40 72.00 6.03 24.24 60.14 37.25 0 28.71 57.00 12.00 33.20 88.62 19.33 21.43 32.39 0 25.30 69.94 35.54 bfloat16
Qwen/Qwen2.5-7B 27.61 85.37 84.89 79.25 81.90 62.00 8.81 20.72 42.66 45.10 5.20 3.93 67.03 10.67 29.60 90.43 19.33 14.29 40.85 0 25.30 80.05 38.19 bfloat16
meta-llama/Meta-Llama-3.1-8B 27.04 83.02 74.52 80.71 81.21 62.00 0 11.53 19.58 45.10 1.60 2.60 66.23 13.07 30.10 90.69 5.88 0 1.41 0 28.86 74.38 41.63 bfloat16
utter-project/EuroLLM-9B 25.87 83.10 67.70 72.24 74.52 70.40 3.25 18.29 7.34 42.48 3.60 0.19 70.26 17.07 31.00 83.11 5.88 14.29 7.04 0 27.71 76.92 44.01 bfloat16
BSC-LT/salamandra-7b-instruct 25.13 84.13 57.33 80.38 82.03 62.00 10.67 7.68 8.74 0 0 19.38 67.83 14.93 19.50 88.78 18.21 21.43 9.86 0 24.28 58.31 30.38 bfloat16
utter-project/EuroLLM-9B-Instruct 24.46 84.81 69.78 80.90 77.76 72.40 11.20 24.57 38.11 26.80 0 26.80 61.91 13.60 26.10 90.79 13.73 21.43 29.58 0 24.82 58.48 33.69 bfloat16
CohereForAI/aya-expanse-8b 24.30 83.45 77.78 78.88 72.24 68.00 9.21 15.53 19.58 0 0 0.46 62.23 8.53 33.90 89.02 13.73 50.00 38.03 0 15.79 77.98 34.08 float16
BSC-LT/salamandra-7b 24.04 81.93 22.07 74.68 78.11 62.80 5.37 21.46 19.58 45.10 2.40 0.17 57.27 10.40 18.60 87.78 5.88 0 15.49 0 26.15 69.21 46.92

Notas: - Average: Media no ponderada de todas las métricas válidas de todas las tareas (46 valores totales) - Promedio (Solo Accuracy): Media no ponderada de todas las métricas acc* (22 valores de accuracy) - Promedio (Todas las métricas): Media no ponderada de todas las métricas válidas de todas las tareas (46 valores totales) - Los resultados de otros modelos provienen de laleaderboard_es - Tango-70b destaca especialmente en: Average (59.90), Belebele Spa (92.00), COPAes (89.60), EsCoLA (71.79), RagQuAS (79.87), SpaLawEx (51.26), y XL-Sumes (19.71) - Tango-70b supera significativamente al segundo mejor modelo (google/gemma-2-9b-it con 33.62) por 26.28 puntos porcentuales

Reproducir los resultados

  1. Creá y activá un virtual-env de Python ≥ 3.9.
    python -m venv .venv source .venv/bin/activate

  2. Instalá dependencias y el harness en modo editable:

bash pip install -r requirements.txt pip install -e .

  1. Logeate en Hugging Face

bash huggingface-cli login

  1. Ejecutá el script de evaluación:

bash chmod +x run_laleaderboard_es.sh ./run_laleaderboard_es.sh

  1. Ejecutá el script de agregación de resultados bash python aggregate_laleaderboard_es_acc.py

El script run_laleaderboard_es.sh recorre cada sub-tarea definida en lm_eval/tasks/laleaderboard/laleaderboard_es.yaml, ejecutando una a la vez. Apenas una tarea termina se escribe el archivo results_<timestamp>.json, por lo que si el proceso se interrumpe conservás todo lo ya completado.

El script aggregate_laleaderboard_es_acc.py lee todos los archivos results_*.json en tango-evals/ y calcula: - Media de métricas de accuracy únicamente - Media de todas las métricas (primera métrica de cada tarea)

Dónde quedan los resultados

Los resultados se guardan en el directorio indicado en OUTPUT_DIR al principio del script (por defecto ./tango-evals). Ejemplo de estructura:

<OUTPUT_DIR>/ └── <model-name-sanitised>/ ├── results_2024-05-29T14-52-17.json # métricas de una tarea └── … # más tareas / timestamps

Los logs de consola por tarea se almacenan en ./logs/ junto al script.

Reanudar o volver a correr

• El script detecta archivos results_*<task>.json existentes y salta esas tareas.
• Podés ajustar MODEL_ARGS, tamaño de batch, dispositivo, etc. editando el encabezado del script.

Hardware

– Hardware usado: 4 × NVIDIA RTX 3090, 256 GB RAM.
– Ajustá batch size o paralelismo según tu GPU.

Owner

  • Name: sandbox.ai
  • Login: sandbox-ai
  • Kind: organization
  • Email: sandboxai@protonmail.com

Inteligencia Artificial - Desarrollo - Innovación

Citation (CITATION.bib)

@misc{eval-harness,
  author       = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = 12,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.4.0},
  doi          = {10.5281/zenodo.10256836},
  url          = {https://zenodo.org/records/10256836}
}

GitHub Events

Total
  • Watch event: 1
  • Push event: 1
  • Create event: 1
Last Year
  • Watch event: 1
  • Push event: 1
  • Create event: 1

Dependencies

.github/workflows/new_tasks.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • tj-actions/changed-files v44.5.2 composite
.github/workflows/publish.yml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/unit_tests.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • pre-commit/action v3.0.1 composite
lm_eval/tasks/japanese_leaderboard/requirements.txt pypi
  • emoji ==2.14.0
  • fugashi *
  • neologdn ==0.5.3
  • rouge_score >=0.1.2
pyproject.toml pypi
  • accelerate >=0.26.0
  • datasets >=2.16.0
  • dill *
  • evaluate >=0.4.0
  • evaluate *
  • jsonlines *
  • more_itertools *
  • numexpr *
  • peft >=0.2.0
  • pybind11 >=2.6.2
  • pytablewriter *
  • rouge-score >=0.0.4
  • sacrebleu >=1.5.0
  • scikit-learn >=0.24.1
  • sentence-transformers >=2.7
  • sqlitedict *
  • torch >=1.8
  • tqdm-multiprocess *
  • transformers >=4.1
  • word2number *
  • zstandard *
requirements.txt pypi
setup.py pypi