Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
✓Institutional organization owner
Organization mlfoundations has institutional domain (people.csail.mit.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Repository
Automatic evals for LLMs
Basic Info
Statistics
- Stars: 522
- Watchers: 17
- Forks: 62
- Open Issues: 28
- Releases: 0
Metadata Files
README.md
🧪 Evalchemy
A unified and easy-to-use toolkit for evaluating post-trained language models

Evalchemy is developed by the DataComp community and Bespoke Labs and builds on the LM-Eval-Harness.
🎉 What's New
[2025.02.24] New Reasoning Benchmarks
- AIME25 and Alice in Wonderland have been added to available benchmarks.
[2025.01.30] API Model Support
- API models via Curator: With
--model curatoryou can now evaluate with even more API based models via Curator, including all those supported by LiteLLM
python -m eval.eval \
--model curator \
--tasks AIME24,MATH500,GPQADiamond \
--model_name "gemini/gemini-2.0-flash-thinking-exp-01-21" \
--apply_chat_template False \
--model_args 'tokenized_requests=False' \
--output_path logs
[2025.01.29] New Reasoning Benchmarks
- AIME24, AMC23, MATH500, LiveCodeBench, GPQADiamond, HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E, and CRUXEval have been added to our growing list of available benchmarks. This is part of the effort in the Open Thoughts project. See the our blog post on using Evalchemy for measuring reasoning models.
[2025.01.28] New Model Support
- vLLM models: High-performance inference and serving engine with PagedAttention technology
bash python -m eval.eval \ --model vllm \ --tasks alpaca_eval \ --model_args "pretrained=meta-llama/Meta-Llama-3-8B-Instruct" \ --batch_size 16 \ --output_path logs - OpenAI models: Full support for OpenAI's model lineup
bash python -m eval.eval \ --model openai-chat-completions \ --tasks alpaca_eval \ --model_args "model=gpt-4o-mini-2024-07-18,num_concurrent=32" \ --batch_size 16 \ --output_path logs
Key Features
- Unified Installation: One-step setup for all benchmarks, eliminating dependency conflicts
- Parallel Evaluation:
- Data-Parallel: Distribute evaluations across multiple GPUs for faster results
- Model-Parallel: Handle large models that don't fit on a single GPU
- Simplified Usage: Run any benchmark with a consistent command-line interface
- Results Management:
- Local results tracking with standardized output format
- Optional database integration for systematic tracking
- Leaderboard submission capability (requires database setup)
⚡ Quick Start
Installation
We suggest using conda (installation instructions).
```bash
Create and activate conda environment
conda create --name evalchemy python=3.10 conda activate evalchemy
Clone the repo
git clone git@github.com:mlfoundations/evalchemy.git
cd evalchemy
Install dependencies
pip install -e . pip install -e eval/chatbenchmarks/alpacaeval
Note: On some HPC systems you may need to modify pyproject.toml
to use absolute paths for the fschat dependency:
Change: "fschat @ file:eval/chat_benchmarks/MTBench"
To: "fschat @ file:///absolute/path/to/evalchemy/eval/chat_benchmarks/MTBench"
Or remove entirely and separately run
pip install -e eval/chat_benchmarks/MTBench
Log into HuggingFace for datasets and models.
huggingface-cli login ```
📚 Available Tasks
Built-in Benchmarks
- All tasks from LM Evaluation Harness
Custom instruction-based tasks (found in
eval/chat_benchmarks/):- MTBench: Multi-turn dialogue evaluation benchmark
- WildBench: Real-world task evaluation
- RepoBench: Code understanding and repository-level tasks
- MixEval: Comprehensive evaluation across domains
- IFEval: Instruction following capability evaluation
- AlpacaEval: Instruction following evaluation
- HumanEval: Code generation and problem solving
- HumanEvalPlus: HumanEval with more test cases
- ZeroEval: Logical reasoning and problem solving
- MBPP: Python programming benchmark
- MBPPPlus: MBPP with more test cases
- BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
🚨 Warning: for BigCodeBench evaluation, we strongly recommend using a Docker container since the execution of LLM generated code on a machine can lead to destructive outcomes. More info is here. - MultiPL-E: Multi-Programming Language Evaluation of Large Language Models of Code - CRUXEval: Code Reasoning, Understanding, and Execution Evaluation - AIME24: Math Reasoning Dataset - AIME25: Math Reasoning Dataset - AMC23: Math Reasoning Dataset - MATH500: Math Reasoning Dataset split from Let's Verify Step by Step - LiveCodeBench: Benchmark of LLMs for code - LiveBench: A benchmark for LLMs designed with test set contamination and objective evaluation in mind - GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in LLMs - Arena-Hard-Auto (Coming soon): Automatic evaluation tool for instruction-tuned LLMs - SWE-Bench (Coming soon): Evaluating large language models on real-world software issues - SafetyBench (Coming soon): Evaluating the safety of LLMs - SciCode Bench (Coming soon): Evaluate language models in generating code for solving realistic scientific research problems - Berkeley Function Calling Leaderboard (Coming soon): Evaluating ability of LLMs to use APIs
We have recorded reproduced results against published numbers for these benchmarks in reproduced_benchmarks.md.
Basic Usage
Make sure your OPENAI_API_KEY is set in your environment before running evaluations, if an LLM judge is required.
bash
python -m eval.eval \
--model hf \
--tasks HumanEval,mmlu \
--model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
--batch_size 2 \
--output_path logs
The results will be written out in output_path. If you have jq installed, you can view the results easily after evaluation. Example: jq '.results' logs/Qwen__Qwen2.5-7B-Instruct/results_2024-11-17T17-12-28.668908.json
Args:
--model: Which model type or provider is evaluated (example: hf)--tasks: Comma-separated list of tasks to be evaluated.--model_args: Model path and parameters. Comma-separated list of parameters passed to the model constructor. Accepts a string of the format"arg1=val1,arg2=val2,...". You can find the list supported arguments here.--batch_size: Batch size for inference--output_path: Directory to save evaluation results
Example running multiple benchmarks:
bash
python -m eval.eval \
--model hf \
--tasks MTBench,WildBench,alpaca_eval \
--model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
--batch_size 2 \
--output_path logs
Config shortcuts:
To be able to reuse commonly used settings without having to manually supply full arguments every time, we support reading eval configs from YAML files. These configs replace the --batch_size, --tasks, and --annoator_model arguments. Some example config files can be found in ./configs. To use these configs, you can use the --config flag as shown below:
bash
python -m eval.eval \
--model hf \
--model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
--output_path logs \
--config configs/light_gpt4omini0718.yaml
We add several more command examples in eval/examples to help you start using Evalchemy.
🔧 Advanced Usage
Support for different models
Through LM-Eval-Harness, we support all HuggingFace models and are currently adding support for all LM-Eval-Harness models, such as OpenAI and VLLM. For more information on such models, please check out the models page.
To choose a model, simply set 'pretrained=
HPC Distributed Evaluation
For even faster evaluation, use full data parallelism and launch a vLLM process for each GPU.
We have made also made this easy to do at scale across multiple nodes on HPC (High-Performance Computing) clusters:
bash
python eval/distributed/launch.py --model_name <model_id> --tasks <task_list> --num_shards <n> --watchdog
Key features: - Run evaluations in parallel across multiple compute nodes - Dramatically reduce wall clock time for large benchmarks - Offline mode support for environments without internet access on GPU nodes - Automatic cluster detection and configuration - Efficient result collection and scoring
Refer to the distributed README for more details.
NOTE: This is configured for specific HPC clusters, but can easily be adapted. Furthermore it can be adapted for a non-HPC setup using CUDA_VISIBLE_DEVICES instead of SLURM job arrays.
Multi-GPU Evaluation
NOTE: this is slower than doing fully data parallel evaluation (see previous section)
bash
accelerate launch --num-processes <num-gpus> --num-machines <num-nodes> \
--multi-gpu -m eval.eval \
--model hf \
--tasks MTBench,alpaca_eval \
--model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3' \
--batch_size 2 \
--output_path logs
Large Model Evaluation
For models that don't fit on a single GPU, use model parallelism:
bash
python -m eval.eval \
--model hf \
--tasks MTBench,alpaca_eval \
--model_args 'pretrained=mistralai/Mistral-7B-Instruct-v0.3,parallelize=True' \
--batch_size 2 \
--output_path logs
💡 Note: While "auto" batch size is supported, we recommend manually tuning the batch size for optimal performance. The optimal batch size depends on the model size, GPU memory, and the specific benchmark. We used a maximum of 32 and a minimum of 4 (for RepoBench) to evaluate Llama-3-8B-Instruct on 8xH100 GPUs.
Output Log Structure
Our generated logs include critical information about each evaluation to help inform your experiments. We highlight important items in our generated logs.
- Model Configuration
model: Model framework usedmodel_args: Model arguments for the model frameworkbatch_size: Size of processing batchesdevice: Computing device specificationannotator_model: Model used for annotation ("gpt-4o-mini-2024-07-18")
- Seed Configuration
random_seed: General random seednumpy_seed: NumPy-specific seedtorch_seed: PyTorch-specific seedfewshot_seed: Seed for few-shot examples
Model Details
model_num_parameters: Number of model parametersmodel_dtype: Model data typemodel_revision: Model versionmodel_sha: Model commit hash
Version Control
git_hash: Repository commit hashdate: Unix timestamp of evaluationtransformers_version: Hugging Face Transformers version
Tokenizer Configuration
tokenizer_pad_token: Padding token detailstokenizer_eos_token: End of sequence tokentokenizer_bos_token: Beginning of sequence tokeneot_token_id: End of text token IDmax_length: Maximum sequence length
Model Settings
model_source: Model source platformmodel_name: Full model identifiermodel_name_sanitized: Sanitized model name for file system usagechat_template: Conversation templatechat_template_sha: Template hash
Timing Information
start_time: Evaluation start timestampend_time: Evaluation end timestamptotal_evaluation_time_seconds: Total duration
Hardware Environment
- PyTorch version and build configuration
- Operating system details
- GPU configuration
- CPU specifications
- CUDA and driver versions
- Relevant library versions
Customizing Evaluation
🤖 Change Annotator Model
As part of Evalchemy, we want to make swapping in different Language Model Judges for standard benchmarks easy. Currently, we support two judge settings. The first is the default setting, where we use a benchmark's default judge. To activate this, you can either do nothing or pass in
bash
--annotator_model auto
In addition to the default assignments, we support using gpt-4o-mini-2024-07-18 as a judge:
bash
--annotator_model gpt-4o-mini-2024-07-18
We are planning on adding support for different judges in the future!
⏱️ Runtime and Cost Analysis
Evalchemy makes running common benchmarks simple, fast, and versatile! We list the speeds and costs for each benchmark we achieve with Evalchemy for Meta-Llama-3-8B-Instruct on 8xH100 GPUs.
| Benchmark | Runtime (8xH100) | Batch Size | Total Tokens | Default Judge Cost ($) | GPT-4o-mini Judge Cost ($) | Notes | |-----------|------------------|------------|--------------|----------------|-------------------|--------| | MTBench | 14:00 | 32 | ~196K | 6.40 | 0.05 | | | WildBench | 38:00 | 32 | ~2.2M | 30.00 | 0.43 | | | RepoBench | 46:00 | 4 | ~23K | - | - | Lower batch size due to memory | | MixEval | 13:00 | 32 | ~4-6M | 3.36 | 0.76 | Varies by judge model | | AlpacaEval | 16:00 | 32 | ~936K | 9.40 | 0.14 | | | HumanEval | 4:00 | 32 | ~300 | - | - | No API costs | | IFEval | 1:30 | 32 | ~550 | - | - | No API costs | | ZeroEval | 1:44:00 | 32 | ~8K | - | - | Longest runtime | | MBPP | 6:00 | 32 | 500 | - | - | No API costs | | MMLU | 7:00 | 32 | 500 | - | - | No API costs | | ARC | 4:00 | 32 | - | - | - | No API costs | | DROP | 20:00 | 32 | - | - | - | No API costs |
Notes: - Runtimes measured using 8x H100 GPUs with Meta-Llama-3-8B-Instruct model - Batch sizes optimized for memory and speed - API costs vary based on judge model choice
Cost-Saving Tips: - Use gpt-4o-mini-2024-07-18 judge when possible for significant cost savings - Adjust batch size based on available memory - Consider using data-parallel evaluation for faster results
🔐 Special Access Requirements
ZeroEval Access
To run ZeroEval benchmarks, you need to:
- Request access to the ZebraLogicBench-private dataset on Hugging Face
- Accept the terms and conditions
- Log in to your Hugging Face account when running evaluations
🛠️ Implementing Custom Evaluations
To add a new evaluation system:
- Create a new directory under
eval/chat_benchmarks/ - Implement
eval_instruct.pywith two required functions:eval_instruct(model): Takes an LM Eval Model, returns results dictevaluate(results): Takes results dictionary, returns evaluation metrics
Adding External Evaluation Repositories
Use git subtree to manage external evaluation code:
```bash
Add external repository
git subtree add --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git main --squash
Pull updates
git subtree pull --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git main --squash
Push contributions back
git subtree push --prefix=eval/chatbenchmarks/neweval https://github.com/original/repo.git contribution-branch ```
🔍 Debug Mode
To run evaluations in debug mode, add the --debug flag:
bash
python -m eval.eval \
--model hf \
--tasks MTBench \
--model_args "pretrained=mistralai/Mistral-7B-Instruct-v0.3" \
--batch_size 2 \
--output_path logs \
--debug
This is particularly useful when testing new evaluation implementations, debugging model configurations, verifying dataset access, and testing database connectivity.
🚀 Performance Tips
- Utilize batch processing for faster evaluation: ```python allinstances.append( Instance( "generateuntil", example, ( inputs, { "maxnewtokens": 1024, "do_sample": False, }, ), idx, ) )
outputs = self.compute(model, all_instances) ```
- Use the LM-eval logger for consistent logging across evaluations
🔧 Troubleshooting
Evalchemy has been tested on CUDA 12.4. If you run into issues like this: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12, try updating your CUDA version:
bash
wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
🏆 Leaderboard Integration
To track experiments and evaluations, we support logging results to a PostgreSQL database. Details on the entry schemas and database setup can be found in database/.
Contributing
Thank you to all the contributors for making this project possible! Please follow these instructions on how to contribute.
Citation
If you find Evalchemy useful, please consider citing us!
@software{Evalchemy: Automatic evals for LLMs,
author = {Raoof, Negin and Guha, Etash Kumar and Marten, Ryan and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Bansal, Hritik and Smyrnis, Georgios and Nezhurina, Marianna and Vu, Trung and Sprague, Zayne Rea and Merrill, Mike A and Chen, Liangyu and Choi, Caroline and Khan, Zaid and Grover, Sachin and Feuer, Benjamin and Suvarna, Ashima and Su, Shiye and Zhao, Wanjia and Sharma, Kartik and Ji, Charlie Cheng-Jie and Arora, Kushal and Li, Jeffrey and Gokaslan, Aaron and Pratt, Sarah M and Muennighoff, Niklas and Saad-Falcon, Jon and Yang, John and Aali, Asad and Pimpalgaonkar, Shreyas and Albalak, Alon and Dave, Achal and Pouransari, Hadi and Durrett, Greg and Oh, Sewoong and Hashimoto, Tatsunori and Shankar, Vaishaal and Choi, Yejin and Bansal, Mohit and Hegde, Chinmay and Heckel, Reinhard and Jitsev, Jenia and Sathiamoorthy, Maheswaran and Dimakis, Alex and Schmidt, Ludwig}
month = June,
title = {{Evalchemy}},
year = {2025}
}
Owner
- Name: mlfoundations
- Login: mlfoundations
- Kind: organization
- Website: https://people.csail.mit.edu/ludwigs/
- Repositories: 12
- Profile: https://github.com/mlfoundations
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software in your work, please cite it as below."
title: "Evalchemy"
authors:
- family-names: "Guha"
given-names: "Etash"
- family-names: "Raoff"
given-names: "Negin"
- family-names: "Mercat"
given-names: "Jean"
- family-names: "Marten"
given-names: "Ryan"
- family-names: "Frankel"
given-names: "Eric"
- family-names: "Keh"
given-names: "Sedrick"
- family-names: "Grover"
given-names: "Sachin"
- family-names: "Smyrnis"
given-names: "George"
- family-names: "Vu"
given-names: "Trung"
- family-names: "Saad-Falcon"
given-names: "Jon"
- family-names: "Choi"
given-names: "Caroline"
- family-names: "Arora"
given-names: "Kushal"
- family-names: "Merrill"
given-names: "Mike"
- family-names: "Deng"
given-names: "Yichuan"
- family-names: "Suvarna"
given-names: "Ashima"
- family-names: "Bansal"
given-names: "Hritik"
- family-names: "Nezhurina"
given-names: "Marianna"
- family-names: "Heckel"
given-names: "Reinhard"
- family-names: "Oh"
given-names: "Seewong"
- family-names: "Hashimoto"
given-names: "Tatsunori"
- family-names: "Jitsev"
given-names: "Jenia"
- family-names: "Choi"
given-names: "Yejin"
- family-names: "Shankar"
given-names: "Vaishaal"
- family-names: "Dimakis"
given-names: "Alex"
- family-names: "Sathiamoorthy"
given-names: "Mahesh"
- family-names: "Schmidt"
given-names: "Ludwig"
date-released: "2024-11-28"
repository: "https://github.com/mlfoundations/evalchemy"
publisher: "GitHub"
type: "software"
GitHub Events
Total
- Create event: 61
- Issues event: 32
- Watch event: 425
- Delete event: 37
- Issue comment event: 100
- Member event: 5
- Push event: 619
- Public event: 1
- Pull request review comment event: 23
- Pull request review event: 55
- Pull request event: 148
- Fork event: 59
Last Year
- Create event: 61
- Issues event: 32
- Watch event: 425
- Delete event: 37
- Issue comment event: 100
- Member event: 5
- Push event: 619
- Public event: 1
- Pull request review comment event: 23
- Pull request review event: 55
- Pull request event: 148
- Fork event: 59
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 24
- Total pull requests: 163
- Average time to close issues: 14 days
- Average time to close pull requests: 4 days
- Total issue authors: 19
- Total pull request authors: 23
- Average comments per issue: 0.38
- Average comments per pull request: 0.67
- Merged pull requests: 131
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 24
- Pull requests: 163
- Average time to close issues: 14 days
- Average time to close pull requests: 4 days
- Issue authors: 19
- Pull request authors: 23
- Average comments per issue: 0.38
- Average comments per pull request: 0.67
- Merged pull requests: 131
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- RyanMarten (4)
- penfever (2)
- sravan500 (2)
- luckyfan-cs (1)
- EtashGuha (1)
- TundeAtSN (1)
- richardbaihe (1)
- aashay-sarvam (1)
- Siki-cloud (1)
- chanansh (1)
- marianna13 (1)
- noobimp (1)
- BeastyZ (1)
- juyongjiang (1)
- slimfrkha (1)
Pull Request Authors
- neginraoof (32)
- sedrick-keh-tri (24)
- RyanMarten (22)
- jmercat (21)
- EtashGuha (11)
- penfever (6)
- marianna13 (6)
- ssu53 (6)
- GeorgiosSmyrnis (4)
- asad-aali (4)
- Zayne-sprague (3)
- Hritikbansal (3)
- reinhardh (3)
- jonsaadfalcon (2)
- younesbelkada (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- nvidia/cuda 12.2.0-runtime-ubuntu20.04 build
- fastchat latest
- absl *
- immutabledict *
- langdetect *
- nltk *
- aiohttp *
- fastapi *
- httpx *
- markdown2 [all]
- nh3 *
- numpy *
- prompt_toolkit >=3.0.0
- psutil *
- pydantic <3,>=2.0.0
- pydantic-settings *
- requests *
- rich >=10.0.0
- shortuuid *
- tiktoken *
- uvicorn *
- Andere *
- SentencePiece >=0.2.0
- accelerate >=0.30.1
- hf_transfer >=0.1.6
- httpx >=0.27.0
- nltk >=3.8.1
- numpy >=1.26.3
- openai >=1.30.5
- pandas >=2.2.2
- prettytable *
- python-dotenv >=1.0.1
- scikit-learn >=1.5.0
- tiktoken >=0.6.0
- tqdm >=4.66.4
- transformers >=4.43.1
- codebleu *
- difflib *
- fire *
- fuzzywuzzy *
- openai *
- python-Levenshtein *
- torch *
- tqdm *
- transformers *
- tree-sitter-java *
- tree-sitter-python *
- anthropic *
- cohere *
- datasets *
- fire *
- google-generativeai *
- jsonlines *
- mistralai ==0.4.2
- openai *
- reka-api *
- tenacity *
- together *
- datasets >=2.20.0
- fire *
- openai >=1.5.0
- pandas *
- patsy *
- scikit-learn *
- scipy *
- tiktoken >=0.3.2
- tqdm *
- datasets *
- fire *
- huggingface_hub *
- openai >=1.5.0
- pandas *
- patsy *
- python-dotenv *
- scikit-learn *
- scipy *
- tiktoken >=0.3.2
- anthropic *
- cohere *
- datasets *
- fire *
- google-generativeai *
- jsonlines *
- mistralai <1.0
- openai *
- reka-api *
- tenacity *
- together *
- accelerate *
- aiofiles *
- aiohttp [speedups]>=3.8
- anthropic *
- asttokens *
- backoff >=2.2
- bitsandbytes *
- black *
- boto3 *
- botocore *
- bs4 *
- codebleu *
- cohere *
- colorama *
- dashscope *
- datasets *
- distro *
- docker-pycreds *
- evaluate *
- faiss-cpu ==1.7.4
- fastapi >=0.101.0
- fastavro *
- fasttext-wheel *
- fire *
- fschat *
- fsspec ==2024.6.1
- fuzzywuzzy *
- gcsfs *
- google-auth ==2.25.1
- google-cloud-aiplatform *
- google-generativeai *
- hf-transfer *
- hjson *
- httpx *
- httpx-sse *
- huggingface_hub [cli]
- immutabledict *
- jiter *
- jmespath *
- joblib *
- jsonpickle *
- langdetect *
- loguru >=0.7
- lxml *
- mistralai *
- msgpack *
- mypy_extensions *
- nltk *
- numpy *
- openai *
- optimum ==1.12.0
- pandas *
- patsy *
- peft *
- portalocker *
- prettytable *
- protobuf *
- psycopg [binary]
- psycopg2-binary *
- py-cpuinfo *
- python-Levenshtein *
- python-box *
- python-dotenv *
- ray [default]
- reka-api *
- requests >=2.28
- sacrebleu *
- sagemaker *
- scikit-learn *
- scipy *
- sentence-transformers >=2.2.2
- sentencepiece *
- sentry-sdk *
- shortuuid *
- sqlalchemy *
- sqlitedict *
- tabulate *
- tenacity *
- tensorboard *
- termcolor *
- threadpoolctl *
- tiktoken *
- together *
- torch *
- torchvision *
- tqdm *
- transformers *
- tree-sitter-java *
- tree-sitter-python *
- tree_sitter *
- trl *
- typing_inspect *
- uvicorn >=0.23.0
- wandb *
- websocket *
- actions/checkout v4 composite
- psf/black stable composite
- nvcr.io/nvidia/pytorch 24.09-py3 build
- nvcr.io/nvidia/pytorch 24.09-py3 build
- Django ==4.2.7
- Faker ==20.1.0
- Flask-Mail ==0.9.1
- Levenshtein ==0.25.0
- Pillow ==10.3.0
- PyYAML ==6.0.1
- Requests ==2.31.0
- WTForms ==3.1.2
- Werkzeug ==3.0.1
- beautifulsoup4 ==4.8.2
- blake3 ==0.4.1
- chardet ==5.2.0
- cryptography ==38.0.0
- datetime ==5.5
- dnspython ==2.6.1
- docxtpl ==0.11.5
- flask ==3.0.3
- flask_login ==0.6.3
- flask_restful ==0.3.10
- flask_wtf ==1.2.1
- folium ==0.16.0
- gensim ==4.3.2
- geopandas ==0.13.2
- geopy ==2.4.1
- holidays ==0.29
- keras ==2.11.0
- librosa ==0.10.1
- lxml ==4.9.3
- matplotlib ==3.7.0
- mechanize ==0.4.9
- natsort ==7.1.1
- networkx ==2.6.3
- nltk ==3.8
- numba ==0.55.0
- numpy ==1.21.2
- opencv-python-headless ==4.9.0.80
- openpyxl ==3.1.2
- pandas ==2.0.3
- prettytable ==3.10.0
- psutil ==5.9.5
- pycryptodome ==3.14.1
- pyfakefs ==5.4.1
- pyquery ==1.4.3
- pytesseract ==0.3.10
- pytest ==8.2.0
- python-Levenshtein-wheels *
- python-dateutil ==2.9.0
- python-docx ==1.1.0
- python_http_client ==3.3.7
- pytz ==2023.3.post1
- requests ==2.31.0
- requests_mock ==1.11.0
- rsa ==4.9
- scikit-image ==0.18.0
- scikit-learn ==1.3.1
- scipy ==1.7.2
- seaborn ==0.13.2
- selenium ==4.15
- sendgrid ==6.11.0
- shapely ==2.0.4
- soundfile ==0.12.1
- statsmodels ==0.14.0
- sympy ==1.12
- tensorflow ==2.11.0
- textblob ==0.18.0
- texttable ==1.7.0
- wikipedia ==1.4.0
- wordcloud ==1.9.3
- wordninja ==2.0.0
- xlrd ==2.0.1
- xlwt ==1.3.0
- xmltodict ==0.13.0
- appdirs >=1.4.4
- multipledispatch >=0.6.0
- pqdm >=0.2.0
- tempdir >=0.7.1
- tqdm >=4.56.0
- tree-sitter ==0.21.3
- tree_sitter_languages >=1.10.2
- wget >=3.2
- absl-py *
- immutabledict *
- langdetect *
- nltk *
- accelerate >=0.21
- aiohttp *
- anthropic >=0.3
- antlr4-python3-runtime ==4.11
- datasets *
- fastapi *
- fschat @git+https://github.com/lm-sys/FastChat#egg=c5223e34babd24c3f9b08205e6751ea6e42c9684
- httpx *
- immutabledict *
- langchain *
- langdetect *
- lark *
- levenshtein >=0.20.4
- lxml *
- markdown2 [all]
- nh3 *
- nltk *
- numpy *
- openai *
- packaging *
- pandas ==2.2.2
- peft *
- prompt_toolkit >=3.0.0
- protobuf *
- psutil *
- pydantic *
- pyext ==0.7
- ray *
- requests *
- rich >=10.0.0
- sentencepiece *
- shortuuid *
- sympy >=1.12
- tenacity *
- tiktoken *
- torch *
- tqdm >=4.62.1
- transformers >=4.31.0
- uvicorn *
- wheel *
- anthropic >=0.49.0
- antlr4-python3-runtime ==4.11
- beautifulsoup4 >=4.13.1
- datasets >= 3.5.0
- google-genai >=1.11.0
- json5 >=0.10.0
- loguru >=0.7.3
- matplotlib >=3.9.3
- openai >=1.67.0
- pandas >=2.2.3
- python-fasthtml >=0.12.1
- regex >=2024.11.6
- requests >=2.32.3
- seaborn >=0.13.2
- sympy >=1.13.1
- together >=1.3.14
- aiohttp *
- antlr4-python3-runtime ==4.11
- asyncpg *
- fastapi *
- psycopg2-binary *
- python-json-logger *
- requests *
- sqlalchemy *
- uvicorn *