evalchemy

Automatic evals for LLMs

https://github.com/mlfoundations/evalchemy

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
✓
Institutional organization owner
Organization mlfoundations has institutional domain (people.csail.mit.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Automatic evals for LLMs

Basic Info

Host: GitHub
Owner: mlfoundations
Language: HTML
Default Branch: main
Homepage:
Size: 54 MB

Statistics

Stars: 522
Watchers: 17
Forks: 62
Open Issues: 28
Releases: 0

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Contributing Citation

🧪 Evalchemy

A unified and easy-to-use toolkit for evaluating post-trained language models

alt text

Evalchemy is developed by the DataComp community and Bespoke Labs and builds on the LM-Eval-Harness.

🎉 What's New

[2025.02.24] New Reasoning Benchmarks

AIME25 and Alice in Wonderland have been added to available benchmarks.

[2025.01.30] API Model Support

API models via Curator: With --model curator you can now evaluate with even more API based models via Curator, including all those supported by LiteLLM

python -m eval.eval \ --model curator \ --tasks AIME24,MATH500,GPQADiamond \ --model_name "gemini/gemini-2.0-flash-thinking-exp-01-21" \ --apply_chat_template False \ --model_args 'tokenized_requests=False' \ --output_path logs

[2025.01.29] New Reasoning Benchmarks

AIME24, AMC23, MATH500, LiveCodeBench, GPQADiamond, HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E, and CRUXEval have been added to our growing list of available benchmarks. This is part of the effort in the Open Thoughts project. See the our blog post on using Evalchemy for measuring reasoning models.

[2025.01.28] New Model Support

vLLM models: High-performance inference and serving engine with PagedAttention technology bash python -m eval.eval \ --model vllm \ --tasks alpaca_eval \ --model_args "pretrained=meta-llama/Meta-Llama-3-8B-Instruct" \ --batch_size 16 \ --output_path logs
OpenAI models: Full support for OpenAI's model lineup bash python -m eval.eval \ --model openai-chat-completions \ --tasks alpaca_eval \ --model_args "model=gpt-4o-mini-2024-07-18,num_concurrent=32" \ --batch_size 16 \ --output_path logs

Key Features

Unified Installation: One-step setup for all benchmarks, eliminating dependency conflicts
Parallel Evaluation:
- Data-Parallel: Distribute evaluations across multiple GPUs for faster results
- Model-Parallel: Handle large models that don't fit on a single GPU
Simplified Usage: Run any benchmark with a consistent command-line interface
Results Management:
- Local results tracking with standardized output format
- Optional database integration for systematic tracking
- Leaderboard submission capability (requires database setup)

⚡ Quick Start

Installation

We suggest using conda (installation instructions).

```bash

Create and activate conda environment

conda create --name evalchemy python=3.10 conda activate evalchemy

Clone the repo

git clone git@github.com:mlfoundations/evalchemy.git
cd evalchemy

Install dependencies

pip install -e . pip install -e eval/chatbenchmarks/alpacaeval

Note: On some HPC systems you may need to modify pyproject.toml

to use absolute paths for the fschat dependency:

Change: "fschat @ file:eval/chat_benchmarks/MTBench"

To: "fschat @ file:///absolute/path/to/evalchemy/eval/chat_benchmarks/MTBench"

Or remove entirely and separately run

pip install -e eval/chat_benchmarks/MTBench

Log into HuggingFace for datasets and models.

huggingface-cli login ```

📚 Available Tasks

Built-in Benchmarks

All tasks from LM Evaluation Harness
Custom instruction-based tasks (found in eval/chat_benchmarks/):
- MTBench: Multi-turn dialogue evaluation benchmark
- WildBench: Real-world task evaluation
- RepoBench: Code understanding and repository-level tasks
- MixEval: Comprehensive evaluation across domains
- IFEval: Instruction following capability evaluation
- AlpacaEval: Instruction following evaluation
- HumanEval: Code generation and problem solving
- HumanEvalPlus: HumanEval with more test cases
- ZeroEval: Logical reasoning and problem solving
- MBPP: Python programming benchmark
- MBPPPlus: MBPP with more test cases
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
🚨 Warning: for BigCodeBench evaluation, we strongly recommend using a Docker container since the execution of LLM generated code on a machine can lead to destructive outcomes. More info is here. - MultiPL-E: Multi-Programming Language Evaluation of Large Language Models of Code - CRUXEval: Code Reasoning, Understanding, and Execution Evaluation - AIME24: Math Reasoning Dataset - AIME25: Math Reasoning Dataset - AMC23: Math Reasoning Dataset - MATH500: Math Reasoning Dataset split from Let's Verify Step by Step - LiveCodeBench: Benchmark of LLMs for code - LiveBench: A benchmark for LLMs designed with test set contamination and objective evaluation in mind - GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in LLMs - Arena-Hard-Auto (Coming soon): Automatic evaluation tool for instruction-tuned LLMs - SWE-Bench (Coming soon): Evaluating large language models on real-world software issues - SafetyBench (Coming soon): Evaluating the safety of LLMs - SciCode Bench (Coming soon): Evaluate language models in generating code for solving realistic scientific research problems - Berkeley Function Calling Leaderboard (Coming soon): Evaluating ability of LLMs to use APIs

We have recorded reproduced results against published numbers for these benchmarks in reproduced_benchmarks.md.

Basic Usage

Make sure your OPENAI_API_KEY is set in your environment before running evaluations, if an LLM judge is required.

bash python -m eval.eval \ --model hf \ --tasks HumanEval,mmlu \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --batch_size 2 \ --output_path logs

The results will be written out in output_path. If you have jq installed, you can view the results easily after evaluation. Example: jq '.results' logs/Qwen__Qwen2.5-7B-Instruct/results_2024-11-17T17-12-28.668908.json

Args:

--model: Which model type or provider is evaluated (example: hf)
--tasks: Comma-separated list of tasks to be evaluated.
--model_args: Model path and parameters. Comma-separated list of parameters passed to the model constructor. Accepts a string of the format "arg1=val1,arg2=val2,...". You can find the list supported arguments here.
--batch_size: Batch size for inference
--output_path: Directory to save evaluation results

Example running multiple benchmarks: bash python -m eval.eval \ --model hf \ --tasks MTBench,WildBench,alpaca_eval \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --batch_size 2 \ --output_path logs

Config shortcuts:

To be able to reuse commonly used settings without having to manually supply full arguments every time, we support reading eval configs from YAML files. These configs replace the --batch_size, --tasks, and --annoator_model arguments. Some example config files can be found in ./configs. To use these configs, you can use the --config flag as shown below:

bash python -m eval.eval \ --model hf \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --output_path logs \ --config configs/light_gpt4omini0718.yaml

We add several more command examples in eval/examples to help you start using Evalchemy.

🔧 Advanced Usage

Support for different models

Through LM-Eval-Harness, we support all HuggingFace models and are currently adding support for all LM-Eval-Harness models, such as OpenAI and VLLM. For more information on such models, please check out the models page.

To choose a model, simply set 'pretrained=' where the model name can either be a HuggingFace model name or a path to a local model.

HPC Distributed Evaluation

For even faster evaluation, use full data parallelism and launch a vLLM process for each GPU.

We have made also made this easy to do at scale across multiple nodes on HPC (High-Performance Computing) clusters:

bash python eval/distributed/launch.py --model_name <model_id> --tasks <task_list> --num_shards <n> --watchdog

Key features: - Run evaluations in parallel across multiple compute nodes - Dramatically reduce wall clock time for large benchmarks - Offline mode support for environments without internet access on GPU nodes - Automatic cluster detection and configuration - Efficient result collection and scoring

Refer to the distributed README for more details.

NOTE: This is configured for specific HPC clusters, but can easily be adapted. Furthermore it can be adapted for a non-HPC setup using CUDA_VISIBLE_DEVICES instead of SLURM job arrays.

Multi-GPU Evaluation

NOTE: this is slower than doing fully data parallel evaluation (see previous section)

bash accelerate launch --num-processes <num-gpus> --num-machines <num-nodes> \ --multi-gpu -m eval.eval \ --model hf \ --tasks MTBench,alpaca_eval \ --model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3' \ --batch_size 2 \ --output_path logs

Large Model Evaluation

For models that don't fit on a single GPU, use model parallelism:

bash python -m eval.eval \ --model hf \ --tasks MTBench,alpaca_eval \ --model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3,parallelize=True' \ --batch_size 2 \ --output_path logs

💡 Note: While "auto" batch size is supported, we recommend manually tuning the batch size for optimal performance. The optimal batch size depends on the model size, GPU memory, and the specific benchmark. We used a maximum of 32 and a minimum of 4 (for RepoBench) to evaluate Llama-3-8B-Instruct on 8xH100 GPUs.

Output Log Structure

Our generated logs include critical information about each evaluation to help inform your experiments. We highlight important items in our generated logs.

Model Configuration
- model: Model framework used
- model_args: Model arguments for the model framework
- batch_size: Size of processing batches
- device: Computing device specification
- annotator_model: Model used for annotation ("gpt-4o-mini-2024-07-18")
Seed Configuration
- random_seed: General random seed
- numpy_seed: NumPy-specific seed
- torch_seed: PyTorch-specific seed
- fewshot_seed: Seed for few-shot examples
Model Details
- model_num_parameters: Number of model parameters
- model_dtype: Model data type
- model_revision: Model version
- model_sha: Model commit hash
Version Control
- git_hash: Repository commit hash
- date: Unix timestamp of evaluation
- transformers_version: Hugging Face Transformers version
Tokenizer Configuration
- tokenizer_pad_token: Padding token details
- tokenizer_eos_token: End of sequence token
- tokenizer_bos_token: Beginning of sequence token
- eot_token_id: End of text token ID
- max_length: Maximum sequence length
Model Settings
- model_source: Model source platform
- model_name: Full model identifier
- model_name_sanitized: Sanitized model name for file system usage
- chat_template: Conversation template
- chat_template_sha: Template hash
Timing Information
- start_time: Evaluation start timestamp
- end_time: Evaluation end timestamp
- total_evaluation_time_seconds: Total duration
Hardware Environment
- PyTorch version and build configuration
- Operating system details
- GPU configuration
- CPU specifications
- CUDA and driver versions
- Relevant library versions

Customizing Evaluation

🤖 Change Annotator Model

As part of Evalchemy, we want to make swapping in different Language Model Judges for standard benchmarks easy. Currently, we support two judge settings. The first is the default setting, where we use a benchmark's default judge. To activate this, you can either do nothing or pass in bash --annotator_model auto In addition to the default assignments, we support using gpt-4o-mini-2024-07-18 as a judge:

bash --annotator_model gpt-4o-mini-2024-07-18

We are planning on adding support for different judges in the future!

⏱️ Runtime and Cost Analysis

Evalchemy makes running common benchmarks simple, fast, and versatile! We list the speeds and costs for each benchmark we achieve with Evalchemy for Meta-Llama-3-8B-Instruct on 8xH100 GPUs.

| Benchmark | Runtime (8xH100) | Batch Size | Total Tokens | Default Judge Cost ($) | GPT-4o-mini Judge Cost ($) | Notes | |-----------|------------------|------------|--------------|----------------|-------------------|--------| | MTBench | 14:00 | 32 | ~196K | 6.40 | 0.05 | | | WildBench | 38:00 | 32 | ~2.2M | 30.00 | 0.43 | | | RepoBench | 46:00 | 4 | ~23K | - | - | Lower batch size due to memory | | MixEval | 13:00 | 32 | ~4-6M | 3.36 | 0.76 | Varies by judge model | | AlpacaEval | 16:00 | 32 | ~936K | 9.40 | 0.14 | | | HumanEval | 4:00 | 32 | ~300 | - | - | No API costs | | IFEval | 1:30 | 32 | ~550 | - | - | No API costs | | ZeroEval | 1:44:00 | 32 | ~8K | - | - | Longest runtime | | MBPP | 6:00 | 32 | 500 | - | - | No API costs | | MMLU | 7:00 | 32 | 500 | - | - | No API costs | | ARC | 4:00 | 32 | - | - | - | No API costs | | DROP | 20:00 | 32 | - | - | - | No API costs |

Notes: - Runtimes measured using 8x H100 GPUs with Meta-Llama-3-8B-Instruct model - Batch sizes optimized for memory and speed - API costs vary based on judge model choice

Cost-Saving Tips: - Use gpt-4o-mini-2024-07-18 judge when possible for significant cost savings - Adjust batch size based on available memory - Consider using data-parallel evaluation for faster results

🔐 Special Access Requirements

ZeroEval Access

To run ZeroEval benchmarks, you need to:

Request access to the ZebraLogicBench-private dataset on Hugging Face
Accept the terms and conditions
Log in to your Hugging Face account when running evaluations

🛠️ Implementing Custom Evaluations

To add a new evaluation system:

Create a new directory under eval/chat_benchmarks/
Implement eval_instruct.py with two required functions:
- eval_instruct(model): Takes an LM Eval Model, returns results dict
- evaluate(results): Takes results dictionary, returns evaluation metrics

Adding External Evaluation Repositories

Use git subtree to manage external evaluation code:

```bash

Add external repository

git subtree add --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git main --squash

Pull updates

git subtree pull --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git main --squash

Push contributions back

git subtree push --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git contribution-branch ```

🔍 Debug Mode

To run evaluations in debug mode, add the --debug flag:

bash python -m eval.eval \ --model hf \ --tasks MTBench \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --batch_size 2 \ --output_path logs \ --debug

This is particularly useful when testing new evaluation implementations, debugging model configurations, verifying dataset access, and testing database connectivity.

🚀 Performance Tips

Utilize batch processing for faster evaluation: ```python allinstances.append( Instance( "generateuntil", example, ( inputs, { "maxnewtokens": 1024, "do_sample": False, }, ), idx, ) )

outputs = self.compute(model, all_instances) ```

Use the LM-eval logger for consistent logging across evaluations

🔧 Troubleshooting

Evalchemy has been tested on CUDA 12.4. If you run into issues like this: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12, try updating your CUDA version: bash wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo add-apt-repository contrib sudo apt-get update sudo apt-get -y install cuda-toolkit-12-4

🏆 Leaderboard Integration

To track experiments and evaluations, we support logging results to a PostgreSQL database. Details on the entry schemas and database setup can be found in database/.

Contributing

Thank you to all the contributors for making this project possible! Please follow these instructions on how to contribute.

Citation

If you find Evalchemy useful, please consider citing us!

@software{Evalchemy: Automatic evals for LLMs, author = {Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zhao, Wanjia and Sharma, Kartik and Ji, Charlie Cheng-Jie and Arora, Kushal and Li, Jeffrey and Gokaslan, Aaron and Pratt, Sarah M and Muennighoff, Niklas and Saad-Falcon, Jon and Yang, John and Aali, Asad and Pimpalgaonkar, Shreyas and Albalak, Alon and Dave, Achal and Pouransari, Hadi and Durrett, Greg and Oh, Sewoong and Hashimoto, Tatsunori and Shankar, Vaishaal and Choi, Yejin and Bansal, Mohit and Hegde, Chinmay and Heckel, Reinhard and Jitsev, Jenia and Sathiamoorthy, Maheswaran and Dimakis, Alex and Schmidt, Ludwig} month = June, title = {{Evalchemy}}, year = {2025} }

Owner

Name: mlfoundations
Login: mlfoundations
Kind: organization

Website: https://people.csail.mit.edu/ludwigs/
Repositories: 12
Profile: https://github.com/mlfoundations

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software in your work, please cite it as below."
title: "Evalchemy"
authors:
  - family-names: "Guha"
    given-names: "Etash"
  - family-names: "Raoff"
    given-names: "Negin"
  - family-names: "Mercat"
    given-names: "Jean"
  - family-names: "Marten"
    given-names: "Ryan"
  - family-names: "Frankel"
    given-names: "Eric"
  - family-names: "Keh"
    given-names: "Sedrick"
  - family-names: "Grover"
    given-names: "Sachin"
  - family-names: "Smyrnis"
    given-names: "George"
  - family-names: "Vu"
    given-names: "Trung"
  - family-names: "Saad-Falcon"
    given-names: "Jon"
  - family-names: "Choi"
    given-names: "Caroline"
  - family-names: "Arora"
    given-names: "Kushal"
  - family-names: "Merrill"
    given-names: "Mike"
  - family-names: "Deng"
    given-names: "Yichuan"
  - family-names: "Suvarna"
    given-names: "Ashima"
  - family-names: "Bansal"
    given-names: "Hritik"
  - family-names: "Nezhurina"
    given-names: "Marianna"
  - family-names: "Heckel"
    given-names: "Reinhard"
  - family-names: "Oh" 
    given-names: "Seewong"
  - family-names: "Hashimoto"
    given-names: "Tatsunori"
  - family-names: "Jitsev"
    given-names: "Jenia"
  - family-names: "Choi"
    given-names: "Yejin"
  - family-names: "Shankar"
    given-names: "Vaishaal"
  - family-names: "Dimakis"
    given-names: "Alex"
  - family-names: "Sathiamoorthy"
    given-names: "Mahesh"
  - family-names: "Schmidt"
    given-names: "Ludwig"
  
date-released: "2024-11-28"
repository: "https://github.com/mlfoundations/evalchemy"
publisher: "GitHub"
type: "software"

GitHub Events

Total

Create event: 61
Issues event: 32
Watch event: 425
Delete event: 37
Issue comment event: 100
Member event: 5
Push event: 619
Public event: 1
Pull request review comment event: 23
Pull request review event: 55
Pull request event: 148
Fork event: 59

Last Year

Create event: 61
Issues event: 32
Watch event: 425
Delete event: 37
Issue comment event: 100
Member event: 5
Push event: 619
Public event: 1
Pull request review comment event: 23
Pull request review event: 55
Pull request event: 148
Fork event: 59

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 24
Total pull requests: 163
Average time to close issues: 14 days
Average time to close pull requests: 4 days
Total issue authors: 19
Total pull request authors: 23
Average comments per issue: 0.38
Average comments per pull request: 0.67
Merged pull requests: 131
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 24
Pull requests: 163
Average time to close issues: 14 days
Average time to close pull requests: 4 days
Issue authors: 19
Pull request authors: 23
Average comments per issue: 0.38
Average comments per pull request: 0.67
Merged pull requests: 131
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

RyanMarten (4)
penfever (2)
sravan500 (2)
luckyfan-cs (1)
EtashGuha (1)
TundeAtSN (1)
richardbaihe (1)
aashay-sarvam (1)
Siki-cloud (1)
chanansh (1)
marianna13 (1)
noobimp (1)
BeastyZ (1)
juyongjiang (1)
slimfrkha (1)

Pull Request Authors

neginraoof (32)
sedrick-keh-tri (24)
RyanMarten (22)
jmercat (21)
EtashGuha (11)
penfever (6)
marianna13 (6)
ssu53 (6)
GeorgiosSmyrnis (4)
asad-aali (4)
Zayne-sprague (3)
Hritikbansal (3)
reinhardh (3)
jonsaadfalcon (2)
younesbelkada (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

eval/chat_benchmarks/MTBench/docker/Dockerfile docker

nvidia/cuda 12.2.0-runtime-ubuntu20.04 build

eval/chat_benchmarks/MTBench/docker/docker-compose.yml docker

fastchat latest

eval/chat_benchmarks/IFEval/requirements.txt pypi

absl *
immutabledict *
langdetect *
nltk *

eval/chat_benchmarks/MTBench/pyproject.toml pypi

aiohttp *
fastapi *
httpx *
markdown2 [all]
nh3 *
numpy *
prompt_toolkit >=3.0.0
psutil *
pydantic <3,>=2.0.0
pydantic-settings *
requests *
rich >=10.0.0
shortuuid *
tiktoken *
uvicorn *

eval/chat_benchmarks/MixEval/setup.py pypi

Andere *
SentencePiece >=0.2.0
accelerate >=0.30.1
hf_transfer >=0.1.6
httpx >=0.27.0
nltk >=3.8.1
numpy >=1.26.3
openai >=1.30.5
pandas >=2.2.2
prettytable *
python-dotenv >=1.0.1
scikit-learn >=1.5.0
tiktoken >=0.6.0
tqdm >=4.66.4
transformers >=4.43.1

eval/chat_benchmarks/RepoBench/requirements.txt pypi

codebleu *
difflib *
fire *
fuzzywuzzy *
openai *
python-Levenshtein *
torch *
tqdm *
transformers *
tree-sitter-java *
tree-sitter-python *

eval/chat_benchmarks/WildBench/requirements.txt pypi

anthropic *
cohere *
datasets *
fire *
google-generativeai *
jsonlines *
mistralai ==0.4.2
openai *
reka-api *
tenacity *
together *

eval/chat_benchmarks/alpaca_eval/requirements.txt pypi

datasets >=2.20.0
fire *
openai >=1.5.0
pandas *
patsy *
scikit-learn *
scipy *
tiktoken >=0.3.2
tqdm *

eval/chat_benchmarks/alpaca_eval/setup.py pypi

datasets *
fire *
huggingface_hub *
openai >=1.5.0
pandas *
patsy *
python-dotenv *
scikit-learn *
scipy *
tiktoken >=0.3.2

eval/chat_benchmarks/zeroeval/requirements.txt pypi

anthropic *
cohere *
datasets *
fire *
google-generativeai *
jsonlines *
mistralai <1.0
openai *
reka-api *
tenacity *
together *

pyproject.toml pypi

accelerate *
aiofiles *
aiohttp [speedups]>=3.8
anthropic *
asttokens *
backoff >=2.2
bitsandbytes *
black *
boto3 *
botocore *
bs4 *
codebleu *
cohere *
colorama *
dashscope *
datasets *
distro *
docker-pycreds *
evaluate *
faiss-cpu ==1.7.4
fastapi >=0.101.0
fastavro *
fasttext-wheel *
fire *
fschat *
fsspec ==2024.6.1
fuzzywuzzy *
gcsfs *
google-auth ==2.25.1
google-cloud-aiplatform *
google-generativeai *
hf-transfer *
hjson *
httpx *
httpx-sse *
huggingface_hub [cli]
immutabledict *
jiter *
jmespath *
joblib *
jsonpickle *
langdetect *
loguru >=0.7
lxml *
mistralai *
msgpack *
mypy_extensions *
nltk *
numpy *
openai *
optimum ==1.12.0
pandas *
patsy *
peft *
portalocker *
prettytable *
protobuf *
psycopg [binary]
psycopg2-binary *
py-cpuinfo *
python-Levenshtein *
python-box *
python-dotenv *
ray [default]
reka-api *
requests >=2.28
sacrebleu *
sagemaker *
scikit-learn *
scipy *
sentence-transformers >=2.2.2
sentencepiece *
sentry-sdk *
shortuuid *
sqlalchemy *
sqlitedict *
tabulate *
tenacity *
tensorboard *
termcolor *
threadpoolctl *
tiktoken *
together *
torch *
torchvision *
tqdm *
transformers *
tree-sitter-java *
tree-sitter-python *
tree_sitter *
trl *
typing_inspect *
uvicorn >=0.23.0
wandb *
websocket *

.github/workflows/black.yaml actions

actions/checkout v4 composite
psf/black stable composite

eval/chat_benchmarks/BigCodeBench/docker/Dockerfile docker

nvcr.io/nvidia/pytorch 24.09-py3 build

eval/chat_benchmarks/MultiPLE/docker/Dockerfile docker

nvcr.io/nvidia/pytorch 24.09-py3 build

eval/chat_benchmarks/BigCodeBench/requirements/requirements-eval.txt pypi

Django ==4.2.7
Faker ==20.1.0
Flask-Mail ==0.9.1
Levenshtein ==0.25.0
Pillow ==10.3.0
PyYAML ==6.0.1
Requests ==2.31.0
WTForms ==3.1.2
Werkzeug ==3.0.1
beautifulsoup4 ==4.8.2
blake3 ==0.4.1
chardet ==5.2.0
cryptography ==38.0.0
datetime ==5.5
dnspython ==2.6.1
docxtpl ==0.11.5
flask ==3.0.3
flask_login ==0.6.3
flask_restful ==0.3.10
flask_wtf ==1.2.1
folium ==0.16.0
gensim ==4.3.2
geopandas ==0.13.2
geopy ==2.4.1
holidays ==0.29
keras ==2.11.0
librosa ==0.10.1
lxml ==4.9.3
matplotlib ==3.7.0
mechanize ==0.4.9
natsort ==7.1.1
networkx ==2.6.3
nltk ==3.8
numba ==0.55.0
numpy ==1.21.2
opencv-python-headless ==4.9.0.80
openpyxl ==3.1.2
pandas ==2.0.3
prettytable ==3.10.0
psutil ==5.9.5
pycryptodome ==3.14.1
pyfakefs ==5.4.1
pyquery ==1.4.3
pytesseract ==0.3.10
pytest ==8.2.0
python-Levenshtein-wheels *
python-dateutil ==2.9.0
python-docx ==1.1.0
python_http_client ==3.3.7
pytz ==2023.3.post1
requests ==2.31.0
requests_mock ==1.11.0
rsa ==4.9
scikit-image ==0.18.0
scikit-learn ==1.3.1
scipy ==1.7.2
seaborn ==0.13.2
selenium ==4.15
sendgrid ==6.11.0
shapely ==2.0.4
soundfile ==0.12.1
statsmodels ==0.14.0
sympy ==1.12
tensorflow ==2.11.0
textblob ==0.18.0
texttable ==1.7.0
wikipedia ==1.4.0
wordcloud ==1.9.3
wordninja ==2.0.0
xlrd ==2.0.1
xlwt ==1.3.0
xmltodict ==0.13.0

eval/chat_benchmarks/BigCodeBench/requirements/requirements.txt pypi

appdirs >=1.4.4
multipledispatch >=0.6.0
pqdm >=0.2.0
tempdir >=0.7.1
tqdm >=4.56.0
tree-sitter ==0.21.3
tree_sitter_languages >=1.10.2
wget >=3.2

eval/chat_benchmarks/LiveBench/livebench/if_runner/instruction_following_eval/requirements.txt pypi

absl-py *
immutabledict *
langdetect *
nltk *

eval/chat_benchmarks/LiveBench/pyproject.toml pypi

accelerate >=0.21
aiohttp *
anthropic >=0.3
antlr4-python3-runtime ==4.11
datasets *
fastapi *
fschat @git+https://github.com/lm-sys/FastChat#egg=c5223e34babd24c3f9b08205e6751ea6e42c9684
httpx *
immutabledict *
langchain *
langdetect *
lark *
levenshtein >=0.20.4
lxml *
markdown2 [all]
nh3 *
nltk *
numpy *
openai *
packaging *
pandas ==2.2.2
peft *
prompt_toolkit >=3.0.0
protobuf *
psutil *
pydantic *
pyext ==0.7
ray *
requests *
rich >=10.0.0
sentencepiece *
shortuuid *
sympy >=1.12
tenacity *
tiktoken *
torch *
tqdm >=4.62.1
transformers >=4.31.0
uvicorn *
wheel *

eval/chat_benchmarks/HMMT/matharena/pyproject.toml pypi

anthropic >=0.49.0
antlr4-python3-runtime ==4.11
beautifulsoup4 >=4.13.1
datasets >= 3.5.0
google-genai >=1.11.0
json5 >=0.10.0
loguru >=0.7.3
matplotlib >=3.9.3
openai >=1.67.0
pandas >=2.2.3
python-fasthtml >=0.12.1
regex >=2024.11.6
requests >=2.32.3
seaborn >=0.13.2
sympy >=1.13.1
together >=1.3.14

requirements.txt pypi

aiohttp *
antlr4-python3-runtime ==4.11
asyncpg *
fastapi *
psycopg2-binary *
python-json-logger *
requests *
sqlalchemy *
uvicorn *