evalchemy

Automatic evals for LLMs

https://github.com/mlfoundations/evalchemy

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
    Organization mlfoundations has institutional domain (people.csail.mit.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Automatic evals for LLMs

Basic Info
  • Host: GitHub
  • Owner: mlfoundations
  • Language: HTML
  • Default Branch: main
  • Homepage:
  • Size: 54 MB
Statistics
  • Stars: 522
  • Watchers: 17
  • Forks: 62
  • Open Issues: 28
  • Releases: 0
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme Contributing Citation

README.md

🧪 Evalchemy

A unified and easy-to-use toolkit for evaluating post-trained language models

alt text

Evalchemy is developed by the DataComp community and Bespoke Labs and builds on the LM-Eval-Harness.

🎉 What's New

[2025.02.24] New Reasoning Benchmarks

[2025.01.30] API Model Support

python -m eval.eval \ --model curator \ --tasks AIME24,MATH500,GPQADiamond \ --model_name "gemini/gemini-2.0-flash-thinking-exp-01-21" \ --apply_chat_template False \ --model_args 'tokenized_requests=False' \ --output_path logs

[2025.01.29] New Reasoning Benchmarks

  • AIME24, AMC23, MATH500, LiveCodeBench, GPQADiamond, HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E, and CRUXEval have been added to our growing list of available benchmarks. This is part of the effort in the Open Thoughts project. See the our blog post on using Evalchemy for measuring reasoning models.

[2025.01.28] New Model Support

  • vLLM models: High-performance inference and serving engine with PagedAttention technology bash python -m eval.eval \ --model vllm \ --tasks alpaca_eval \ --model_args "pretrained=meta-llama/Meta-Llama-3-8B-Instruct" \ --batch_size 16 \ --output_path logs
  • OpenAI models: Full support for OpenAI's model lineup bash python -m eval.eval \ --model openai-chat-completions \ --tasks alpaca_eval \ --model_args "model=gpt-4o-mini-2024-07-18,num_concurrent=32" \ --batch_size 16 \ --output_path logs

Key Features

  • Unified Installation: One-step setup for all benchmarks, eliminating dependency conflicts
  • Parallel Evaluation:
    • Data-Parallel: Distribute evaluations across multiple GPUs for faster results
    • Model-Parallel: Handle large models that don't fit on a single GPU
  • Simplified Usage: Run any benchmark with a consistent command-line interface
  • Results Management:
    • Local results tracking with standardized output format
    • Optional database integration for systematic tracking
    • Leaderboard submission capability (requires database setup)

⚡ Quick Start

Installation

We suggest using conda (installation instructions).

```bash

Create and activate conda environment

conda create --name evalchemy python=3.10 conda activate evalchemy

Clone the repo

git clone git@github.com:mlfoundations/evalchemy.git
cd evalchemy

Install dependencies

pip install -e . pip install -e eval/chatbenchmarks/alpacaeval

Note: On some HPC systems you may need to modify pyproject.toml

to use absolute paths for the fschat dependency:

Change: "fschat @ file:eval/chat_benchmarks/MTBench"

To: "fschat @ file:///absolute/path/to/evalchemy/eval/chat_benchmarks/MTBench"

Or remove entirely and separately run

pip install -e eval/chat_benchmarks/MTBench

Log into HuggingFace for datasets and models.

huggingface-cli login ```

📚 Available Tasks

Built-in Benchmarks

We have recorded reproduced results against published numbers for these benchmarks in reproduced_benchmarks.md.

Basic Usage

Make sure your OPENAI_API_KEY is set in your environment before running evaluations, if an LLM judge is required.

bash python -m eval.eval \ --model hf \ --tasks HumanEval,mmlu \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --batch_size 2 \ --output_path logs

The results will be written out in output_path. If you have jq installed, you can view the results easily after evaluation. Example: jq '.results' logs/Qwen__Qwen2.5-7B-Instruct/results_2024-11-17T17-12-28.668908.json

Args:

  • --model: Which model type or provider is evaluated (example: hf)
  • --tasks: Comma-separated list of tasks to be evaluated.
  • --model_args: Model path and parameters. Comma-separated list of parameters passed to the model constructor. Accepts a string of the format "arg1=val1,arg2=val2,...". You can find the list supported arguments here.
  • --batch_size: Batch size for inference
  • --output_path: Directory to save evaluation results

Example running multiple benchmarks: bash python -m eval.eval \ --model hf \ --tasks MTBench,WildBench,alpaca_eval \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --batch_size 2 \ --output_path logs

Config shortcuts:

To be able to reuse commonly used settings without having to manually supply full arguments every time, we support reading eval configs from YAML files. These configs replace the --batch_size, --tasks, and --annoator_model arguments. Some example config files can be found in ./configs. To use these configs, you can use the --config flag as shown below:

bash python -m eval.eval \ --model hf \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --output_path logs \ --config configs/light_gpt4omini0718.yaml

We add several more command examples in eval/examples to help you start using Evalchemy.

🔧 Advanced Usage

Support for different models

Through LM-Eval-Harness, we support all HuggingFace models and are currently adding support for all LM-Eval-Harness models, such as OpenAI and VLLM. For more information on such models, please check out the models page.

To choose a model, simply set 'pretrained=' where the model name can either be a HuggingFace model name or a path to a local model.

HPC Distributed Evaluation

For even faster evaluation, use full data parallelism and launch a vLLM process for each GPU.

We have made also made this easy to do at scale across multiple nodes on HPC (High-Performance Computing) clusters:

bash python eval/distributed/launch.py --model_name <model_id> --tasks <task_list> --num_shards <n> --watchdog

Key features: - Run evaluations in parallel across multiple compute nodes - Dramatically reduce wall clock time for large benchmarks - Offline mode support for environments without internet access on GPU nodes - Automatic cluster detection and configuration - Efficient result collection and scoring

Refer to the distributed README for more details.

NOTE: This is configured for specific HPC clusters, but can easily be adapted. Furthermore it can be adapted for a non-HPC setup using CUDA_VISIBLE_DEVICES instead of SLURM job arrays.

Multi-GPU Evaluation

NOTE: this is slower than doing fully data parallel evaluation (see previous section)

bash accelerate launch --num-processes <num-gpus> --num-machines <num-nodes> \ --multi-gpu -m eval.eval \ --model hf \ --tasks MTBench,alpaca_eval \ --model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3' \ --batch_size 2 \ --output_path logs

Large Model Evaluation

For models that don't fit on a single GPU, use model parallelism:

bash python -m eval.eval \ --model hf \ --tasks MTBench,alpaca_eval \ --model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3,parallelize=True' \ --batch_size 2 \ --output_path logs

💡 Note: While "auto" batch size is supported, we recommend manually tuning the batch size for optimal performance. The optimal batch size depends on the model size, GPU memory, and the specific benchmark. We used a maximum of 32 and a minimum of 4 (for RepoBench) to evaluate Llama-3-8B-Instruct on 8xH100 GPUs.

Output Log Structure

Our generated logs include critical information about each evaluation to help inform your experiments. We highlight important items in our generated logs.

  • Model Configuration
    • model: Model framework used
    • model_args: Model arguments for the model framework
    • batch_size: Size of processing batches
    • device: Computing device specification
    • annotator_model: Model used for annotation ("gpt-4o-mini-2024-07-18")
  • Seed Configuration
    • random_seed: General random seed
    • numpy_seed: NumPy-specific seed
    • torch_seed: PyTorch-specific seed
    • fewshot_seed: Seed for few-shot examples
  • Model Details

    • model_num_parameters: Number of model parameters
    • model_dtype: Model data type
    • model_revision: Model version
    • model_sha: Model commit hash
  • Version Control

    • git_hash: Repository commit hash
    • date: Unix timestamp of evaluation
    • transformers_version: Hugging Face Transformers version
  • Tokenizer Configuration

    • tokenizer_pad_token: Padding token details
    • tokenizer_eos_token: End of sequence token
    • tokenizer_bos_token: Beginning of sequence token
    • eot_token_id: End of text token ID
    • max_length: Maximum sequence length
  • Model Settings

    • model_source: Model source platform
    • model_name: Full model identifier
    • model_name_sanitized: Sanitized model name for file system usage
    • chat_template: Conversation template
    • chat_template_sha: Template hash
  • Timing Information

    • start_time: Evaluation start timestamp
    • end_time: Evaluation end timestamp
    • total_evaluation_time_seconds: Total duration
  • Hardware Environment

    • PyTorch version and build configuration
    • Operating system details
    • GPU configuration
    • CPU specifications
    • CUDA and driver versions
    • Relevant library versions

Customizing Evaluation

🤖 Change Annotator Model

As part of Evalchemy, we want to make swapping in different Language Model Judges for standard benchmarks easy. Currently, we support two judge settings. The first is the default setting, where we use a benchmark's default judge. To activate this, you can either do nothing or pass in bash --annotator_model auto In addition to the default assignments, we support using gpt-4o-mini-2024-07-18 as a judge:

bash --annotator_model gpt-4o-mini-2024-07-18

We are planning on adding support for different judges in the future!

⏱️ Runtime and Cost Analysis

Evalchemy makes running common benchmarks simple, fast, and versatile! We list the speeds and costs for each benchmark we achieve with Evalchemy for Meta-Llama-3-8B-Instruct on 8xH100 GPUs.

| Benchmark | Runtime (8xH100) | Batch Size | Total Tokens | Default Judge Cost ($) | GPT-4o-mini Judge Cost ($) | Notes | |-----------|------------------|------------|--------------|----------------|-------------------|--------| | MTBench | 14:00 | 32 | ~196K | 6.40 | 0.05 | | | WildBench | 38:00 | 32 | ~2.2M | 30.00 | 0.43 | | | RepoBench | 46:00 | 4 | ~23K | - | - | Lower batch size due to memory | | MixEval | 13:00 | 32 | ~4-6M | 3.36 | 0.76 | Varies by judge model | | AlpacaEval | 16:00 | 32 | ~936K | 9.40 | 0.14 | | | HumanEval | 4:00 | 32 | ~300 | - | - | No API costs | | IFEval | 1:30 | 32 | ~550 | - | - | No API costs | | ZeroEval | 1:44:00 | 32 | ~8K | - | - | Longest runtime | | MBPP | 6:00 | 32 | 500 | - | - | No API costs | | MMLU | 7:00 | 32 | 500 | - | - | No API costs | | ARC | 4:00 | 32 | - | - | - | No API costs | | DROP | 20:00 | 32 | - | - | - | No API costs |

Notes: - Runtimes measured using 8x H100 GPUs with Meta-Llama-3-8B-Instruct model - Batch sizes optimized for memory and speed - API costs vary based on judge model choice

Cost-Saving Tips: - Use gpt-4o-mini-2024-07-18 judge when possible for significant cost savings - Adjust batch size based on available memory - Consider using data-parallel evaluation for faster results

🔐 Special Access Requirements

ZeroEval Access

To run ZeroEval benchmarks, you need to:

  1. Request access to the ZebraLogicBench-private dataset on Hugging Face
  2. Accept the terms and conditions
  3. Log in to your Hugging Face account when running evaluations

🛠️ Implementing Custom Evaluations

To add a new evaluation system:

  1. Create a new directory under eval/chat_benchmarks/
  2. Implement eval_instruct.py with two required functions:
    • eval_instruct(model): Takes an LM Eval Model, returns results dict
    • evaluate(results): Takes results dictionary, returns evaluation metrics

Adding External Evaluation Repositories

Use git subtree to manage external evaluation code:

```bash

Add external repository

git subtree add --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git main --squash

Pull updates

git subtree pull --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git main --squash

Push contributions back

git subtree push --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git contribution-branch ```

🔍 Debug Mode

To run evaluations in debug mode, add the --debug flag:

bash python -m eval.eval \ --model hf \ --tasks MTBench \ --model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \ --batch_size 2 \ --output_path logs \ --debug

This is particularly useful when testing new evaluation implementations, debugging model configurations, verifying dataset access, and testing database connectivity.

🚀 Performance Tips

  1. Utilize batch processing for faster evaluation: ```python allinstances.append( Instance( "generateuntil", example, ( inputs, { "maxnewtokens": 1024, "do_sample": False, }, ), idx, ) )

outputs = self.compute(model, all_instances) ```

  1. Use the LM-eval logger for consistent logging across evaluations

🔧 Troubleshooting

Evalchemy has been tested on CUDA 12.4. If you run into issues like this: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12, try updating your CUDA version: bash wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo add-apt-repository contrib sudo apt-get update sudo apt-get -y install cuda-toolkit-12-4

🏆 Leaderboard Integration

To track experiments and evaluations, we support logging results to a PostgreSQL database. Details on the entry schemas and database setup can be found in database/.

Contributing

Thank you to all the contributors for making this project possible! Please follow these instructions on how to contribute.

Citation

If you find Evalchemy useful, please consider citing us!

@software{Evalchemy: Automatic evals for LLMs, author = {Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zhao, Wanjia and Sharma, Kartik and Ji, Charlie Cheng-Jie and Arora, Kushal and Li, Jeffrey and Gokaslan, Aaron and Pratt, Sarah M and Muennighoff, Niklas and Saad-Falcon, Jon and Yang, John and Aali, Asad and Pimpalgaonkar, Shreyas and Albalak, Alon and Dave, Achal and Pouransari, Hadi and Durrett, Greg and Oh, Sewoong and Hashimoto, Tatsunori and Shankar, Vaishaal and Choi, Yejin and Bansal, Mohit and Hegde, Chinmay and Heckel, Reinhard and Jitsev, Jenia and Sathiamoorthy, Maheswaran and Dimakis, Alex and Schmidt, Ludwig} month = June, title = {{Evalchemy}}, year = {2025} }

Owner

  • Name: mlfoundations
  • Login: mlfoundations
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software in your work, please cite it as below."
title: "Evalchemy"
authors:
  - family-names: "Guha"
    given-names: "Etash"
  - family-names: "Raoff"
    given-names: "Negin"
  - family-names: "Mercat"
    given-names: "Jean"
  - family-names: "Marten"
    given-names: "Ryan"
  - family-names: "Frankel"
    given-names: "Eric"
  - family-names: "Keh"
    given-names: "Sedrick"
  - family-names: "Grover"
    given-names: "Sachin"
  - family-names: "Smyrnis"
    given-names: "George"
  - family-names: "Vu"
    given-names: "Trung"
  - family-names: "Saad-Falcon"
    given-names: "Jon"
  - family-names: "Choi"
    given-names: "Caroline"
  - family-names: "Arora"
    given-names: "Kushal"
  - family-names: "Merrill"
    given-names: "Mike"
  - family-names: "Deng"
    given-names: "Yichuan"
  - family-names: "Suvarna"
    given-names: "Ashima"
  - family-names: "Bansal"
    given-names: "Hritik"
  - family-names: "Nezhurina"
    given-names: "Marianna"
  - family-names: "Heckel"
    given-names: "Reinhard"
  - family-names: "Oh" 
    given-names: "Seewong"
  - family-names: "Hashimoto"
    given-names: "Tatsunori"
  - family-names: "Jitsev"
    given-names: "Jenia"
  - family-names: "Choi"
    given-names: "Yejin"
  - family-names: "Shankar"
    given-names: "Vaishaal"
  - family-names: "Dimakis"
    given-names: "Alex"
  - family-names: "Sathiamoorthy"
    given-names: "Mahesh"
  - family-names: "Schmidt"
    given-names: "Ludwig"
  
date-released: "2024-11-28"
repository: "https://github.com/mlfoundations/evalchemy"
publisher: "GitHub"
type: "software"

GitHub Events

Total
  • Create event: 61
  • Issues event: 32
  • Watch event: 425
  • Delete event: 37
  • Issue comment event: 100
  • Member event: 5
  • Push event: 619
  • Public event: 1
  • Pull request review comment event: 23
  • Pull request review event: 55
  • Pull request event: 148
  • Fork event: 59
Last Year
  • Create event: 61
  • Issues event: 32
  • Watch event: 425
  • Delete event: 37
  • Issue comment event: 100
  • Member event: 5
  • Push event: 619
  • Public event: 1
  • Pull request review comment event: 23
  • Pull request review event: 55
  • Pull request event: 148
  • Fork event: 59

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 24
  • Total pull requests: 163
  • Average time to close issues: 14 days
  • Average time to close pull requests: 4 days
  • Total issue authors: 19
  • Total pull request authors: 23
  • Average comments per issue: 0.38
  • Average comments per pull request: 0.67
  • Merged pull requests: 131
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 24
  • Pull requests: 163
  • Average time to close issues: 14 days
  • Average time to close pull requests: 4 days
  • Issue authors: 19
  • Pull request authors: 23
  • Average comments per issue: 0.38
  • Average comments per pull request: 0.67
  • Merged pull requests: 131
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • RyanMarten (4)
  • penfever (2)
  • sravan500 (2)
  • luckyfan-cs (1)
  • EtashGuha (1)
  • TundeAtSN (1)
  • richardbaihe (1)
  • aashay-sarvam (1)
  • Siki-cloud (1)
  • chanansh (1)
  • marianna13 (1)
  • noobimp (1)
  • BeastyZ (1)
  • juyongjiang (1)
  • slimfrkha (1)
Pull Request Authors
  • neginraoof (32)
  • sedrick-keh-tri (24)
  • RyanMarten (22)
  • jmercat (21)
  • EtashGuha (11)
  • penfever (6)
  • marianna13 (6)
  • ssu53 (6)
  • GeorgiosSmyrnis (4)
  • asad-aali (4)
  • Zayne-sprague (3)
  • Hritikbansal (3)
  • reinhardh (3)
  • jonsaadfalcon (2)
  • younesbelkada (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

eval/chat_benchmarks/MTBench/docker/Dockerfile docker
  • nvidia/cuda 12.2.0-runtime-ubuntu20.04 build
eval/chat_benchmarks/MTBench/docker/docker-compose.yml docker
  • fastchat latest
eval/chat_benchmarks/IFEval/requirements.txt pypi
  • absl *
  • immutabledict *
  • langdetect *
  • nltk *
eval/chat_benchmarks/MTBench/pyproject.toml pypi
  • aiohttp *
  • fastapi *
  • httpx *
  • markdown2 [all]
  • nh3 *
  • numpy *
  • prompt_toolkit >=3.0.0
  • psutil *
  • pydantic <3,>=2.0.0
  • pydantic-settings *
  • requests *
  • rich >=10.0.0
  • shortuuid *
  • tiktoken *
  • uvicorn *
eval/chat_benchmarks/MixEval/setup.py pypi
  • Andere *
  • SentencePiece >=0.2.0
  • accelerate >=0.30.1
  • hf_transfer >=0.1.6
  • httpx >=0.27.0
  • nltk >=3.8.1
  • numpy >=1.26.3
  • openai >=1.30.5
  • pandas >=2.2.2
  • prettytable *
  • python-dotenv >=1.0.1
  • scikit-learn >=1.5.0
  • tiktoken >=0.6.0
  • tqdm >=4.66.4
  • transformers >=4.43.1
eval/chat_benchmarks/RepoBench/requirements.txt pypi
  • codebleu *
  • difflib *
  • fire *
  • fuzzywuzzy *
  • openai *
  • python-Levenshtein *
  • torch *
  • tqdm *
  • transformers *
  • tree-sitter-java *
  • tree-sitter-python *
eval/chat_benchmarks/WildBench/requirements.txt pypi
  • anthropic *
  • cohere *
  • datasets *
  • fire *
  • google-generativeai *
  • jsonlines *
  • mistralai ==0.4.2
  • openai *
  • reka-api *
  • tenacity *
  • together *
eval/chat_benchmarks/alpaca_eval/requirements.txt pypi
  • datasets >=2.20.0
  • fire *
  • openai >=1.5.0
  • pandas *
  • patsy *
  • scikit-learn *
  • scipy *
  • tiktoken >=0.3.2
  • tqdm *
eval/chat_benchmarks/alpaca_eval/setup.py pypi
  • datasets *
  • fire *
  • huggingface_hub *
  • openai >=1.5.0
  • pandas *
  • patsy *
  • python-dotenv *
  • scikit-learn *
  • scipy *
  • tiktoken >=0.3.2
eval/chat_benchmarks/zeroeval/requirements.txt pypi
  • anthropic *
  • cohere *
  • datasets *
  • fire *
  • google-generativeai *
  • jsonlines *
  • mistralai <1.0
  • openai *
  • reka-api *
  • tenacity *
  • together *
pyproject.toml pypi
  • accelerate *
  • aiofiles *
  • aiohttp [speedups]>=3.8
  • anthropic *
  • asttokens *
  • backoff >=2.2
  • bitsandbytes *
  • black *
  • boto3 *
  • botocore *
  • bs4 *
  • codebleu *
  • cohere *
  • colorama *
  • dashscope *
  • datasets *
  • distro *
  • docker-pycreds *
  • evaluate *
  • faiss-cpu ==1.7.4
  • fastapi >=0.101.0
  • fastavro *
  • fasttext-wheel *
  • fire *
  • fschat *
  • fsspec ==2024.6.1
  • fuzzywuzzy *
  • gcsfs *
  • google-auth ==2.25.1
  • google-cloud-aiplatform *
  • google-generativeai *
  • hf-transfer *
  • hjson *
  • httpx *
  • httpx-sse *
  • huggingface_hub [cli]
  • immutabledict *
  • jiter *
  • jmespath *
  • joblib *
  • jsonpickle *
  • langdetect *
  • loguru >=0.7
  • lxml *
  • mistralai *
  • msgpack *
  • mypy_extensions *
  • nltk *
  • numpy *
  • openai *
  • optimum ==1.12.0
  • pandas *
  • patsy *
  • peft *
  • portalocker *
  • prettytable *
  • protobuf *
  • psycopg [binary]
  • psycopg2-binary *
  • py-cpuinfo *
  • python-Levenshtein *
  • python-box *
  • python-dotenv *
  • ray [default]
  • reka-api *
  • requests >=2.28
  • sacrebleu *
  • sagemaker *
  • scikit-learn *
  • scipy *
  • sentence-transformers >=2.2.2
  • sentencepiece *
  • sentry-sdk *
  • shortuuid *
  • sqlalchemy *
  • sqlitedict *
  • tabulate *
  • tenacity *
  • tensorboard *
  • termcolor *
  • threadpoolctl *
  • tiktoken *
  • together *
  • torch *
  • torchvision *
  • tqdm *
  • transformers *
  • tree-sitter-java *
  • tree-sitter-python *
  • tree_sitter *
  • trl *
  • typing_inspect *
  • uvicorn >=0.23.0
  • wandb *
  • websocket *
.github/workflows/black.yaml actions
  • actions/checkout v4 composite
  • psf/black stable composite
eval/chat_benchmarks/BigCodeBench/docker/Dockerfile docker
  • nvcr.io/nvidia/pytorch 24.09-py3 build
eval/chat_benchmarks/MultiPLE/docker/Dockerfile docker
  • nvcr.io/nvidia/pytorch 24.09-py3 build
eval/chat_benchmarks/BigCodeBench/requirements/requirements-eval.txt pypi
  • Django ==4.2.7
  • Faker ==20.1.0
  • Flask-Mail ==0.9.1
  • Levenshtein ==0.25.0
  • Pillow ==10.3.0
  • PyYAML ==6.0.1
  • Requests ==2.31.0
  • WTForms ==3.1.2
  • Werkzeug ==3.0.1
  • beautifulsoup4 ==4.8.2
  • blake3 ==0.4.1
  • chardet ==5.2.0
  • cryptography ==38.0.0
  • datetime ==5.5
  • dnspython ==2.6.1
  • docxtpl ==0.11.5
  • flask ==3.0.3
  • flask_login ==0.6.3
  • flask_restful ==0.3.10
  • flask_wtf ==1.2.1
  • folium ==0.16.0
  • gensim ==4.3.2
  • geopandas ==0.13.2
  • geopy ==2.4.1
  • holidays ==0.29
  • keras ==2.11.0
  • librosa ==0.10.1
  • lxml ==4.9.3
  • matplotlib ==3.7.0
  • mechanize ==0.4.9
  • natsort ==7.1.1
  • networkx ==2.6.3
  • nltk ==3.8
  • numba ==0.55.0
  • numpy ==1.21.2
  • opencv-python-headless ==4.9.0.80
  • openpyxl ==3.1.2
  • pandas ==2.0.3
  • prettytable ==3.10.0
  • psutil ==5.9.5
  • pycryptodome ==3.14.1
  • pyfakefs ==5.4.1
  • pyquery ==1.4.3
  • pytesseract ==0.3.10
  • pytest ==8.2.0
  • python-Levenshtein-wheels *
  • python-dateutil ==2.9.0
  • python-docx ==1.1.0
  • python_http_client ==3.3.7
  • pytz ==2023.3.post1
  • requests ==2.31.0
  • requests_mock ==1.11.0
  • rsa ==4.9
  • scikit-image ==0.18.0
  • scikit-learn ==1.3.1
  • scipy ==1.7.2
  • seaborn ==0.13.2
  • selenium ==4.15
  • sendgrid ==6.11.0
  • shapely ==2.0.4
  • soundfile ==0.12.1
  • statsmodels ==0.14.0
  • sympy ==1.12
  • tensorflow ==2.11.0
  • textblob ==0.18.0
  • texttable ==1.7.0
  • wikipedia ==1.4.0
  • wordcloud ==1.9.3
  • wordninja ==2.0.0
  • xlrd ==2.0.1
  • xlwt ==1.3.0
  • xmltodict ==0.13.0
eval/chat_benchmarks/BigCodeBench/requirements/requirements.txt pypi
  • appdirs >=1.4.4
  • multipledispatch >=0.6.0
  • pqdm >=0.2.0
  • tempdir >=0.7.1
  • tqdm >=4.56.0
  • tree-sitter ==0.21.3
  • tree_sitter_languages >=1.10.2
  • wget >=3.2
eval/chat_benchmarks/LiveBench/livebench/if_runner/instruction_following_eval/requirements.txt pypi
  • absl-py *
  • immutabledict *
  • langdetect *
  • nltk *
eval/chat_benchmarks/LiveBench/pyproject.toml pypi
  • accelerate >=0.21
  • aiohttp *
  • anthropic >=0.3
  • antlr4-python3-runtime ==4.11
  • datasets *
  • fastapi *
  • fschat @git+https://github.com/lm-sys/FastChat#egg=c5223e34babd24c3f9b08205e6751ea6e42c9684
  • httpx *
  • immutabledict *
  • langchain *
  • langdetect *
  • lark *
  • levenshtein >=0.20.4
  • lxml *
  • markdown2 [all]
  • nh3 *
  • nltk *
  • numpy *
  • openai *
  • packaging *
  • pandas ==2.2.2
  • peft *
  • prompt_toolkit >=3.0.0
  • protobuf *
  • psutil *
  • pydantic *
  • pyext ==0.7
  • ray *
  • requests *
  • rich >=10.0.0
  • sentencepiece *
  • shortuuid *
  • sympy >=1.12
  • tenacity *
  • tiktoken *
  • torch *
  • tqdm >=4.62.1
  • transformers >=4.31.0
  • uvicorn *
  • wheel *
eval/chat_benchmarks/HMMT/matharena/pyproject.toml pypi
  • anthropic >=0.49.0
  • antlr4-python3-runtime ==4.11
  • beautifulsoup4 >=4.13.1
  • datasets >= 3.5.0
  • google-genai >=1.11.0
  • json5 >=0.10.0
  • loguru >=0.7.3
  • matplotlib >=3.9.3
  • openai >=1.67.0
  • pandas >=2.2.3
  • python-fasthtml >=0.12.1
  • regex >=2024.11.6
  • requests >=2.32.3
  • seaborn >=0.13.2
  • sympy >=1.13.1
  • together >=1.3.14
requirements.txt pypi
  • aiohttp *
  • antlr4-python3-runtime ==4.11
  • asyncpg *
  • fastapi *
  • psycopg2-binary *
  • python-json-logger *
  • requests *
  • sqlalchemy *
  • uvicorn *