evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Keywords
Repository
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Basic Info
- Host: GitHub
- Owner: evalplus
- License: apache-2.0
- Language: Python
- Default Branch: master
- Homepage: https://evalplus.github.io
- Size: 5.3 MB
Statistics
- Stars: 1,557
- Watchers: 10
- Forks: 172
- Open Issues: 50
- Releases: 11
Topics
Metadata Files
README.md
EvalPlus(📖) => 📚
📙About • 🔥Quick Start • 🚀LLM Backends • 📚Documents • 📜Citation • 🙏Acknowledgement
📢 News
Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:
- Meta Llama 3.1 and 3.3
- Allen AI TÜLU 1/2/3
- Qwen2.5-Coder
- CodeQwen 1.5
- DeepSeek-Coder V2
- Qwen2
- Snowflake Arctic
- StarCoder2
- Magicoder
- WizardCoder
Below tracks the notable updates of EvalPlus:
- [2024-10-20
v0.3.1]: EvalPlusv0.3.1is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc. - [2024-06-09 pre
v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena. - [2024-04-17 pre
v0.3.0]: MBPP+ is upgraded tov0.2.0by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
Earlier news :: click to expand ::
📙 About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ EvalPerf: evaluating the efficiency of LLM-generated code!
- ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
- ✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.
Want to know more details? Read our papers & materials!
- EvalPlus: NeurIPS'23 paper, Slides, Poster, Leaderboard
- EvalPerf: COLM'24 paper, Poster, Documentation, Leaderboard
🔥 Quick Start
Code Correctness Evaluation: HumanEval(+) or MBPP(+)
```bash pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
Or pip install "evalplus[vllm]" --upgrade for the latest stable release
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy ```
🛡️ Safe code execution within Docker :: click to expand ::
Code Efficiency Evaluation: EvalPerf (*nix only)
```bash pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
Or pip install "evalplus[perf,vllm]" --upgrade for the latest stable release
sudo sh -c 'echo 0 > /proc/sys/kernel/perfeventparanoid' # Enable perf evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm ```
🛡️ Safe code execution within Docker :: click to expand ::
🚀 LLM Backends
HuggingFace models
transformersbackend:
bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend hf \
--greedy
[!Note]
EvalPlus uses different prompts for base and chat models. By default it is detected by
tokenizer.chat_templatewhen usinghf/vllmas backend. For other backends, only chat mode is allowed.Therefore, if your base models come with a
tokenizer.chat_template, please add--force-base-promptto avoid being evaluated in a chat mode.
Enable Flash Attention 2 :: click to expand ::
vllmbackend:
bash
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
--dataset [humaneval|mbpp] \
--backend vllm \
--tp [TENSOR_PARALLEL_SIZE] \
--greedy
openaicompatible servers (e.g., vLLM):
```bash
OpenAI models
export OPENAIAPIKEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys evalplus.evaluate --model "gpt-4o-2024-08-06" \ --dataset [humaneval|mbpp] \ --backend openai --greedy
DeepSeek
export OPENAIAPIKEY="{KEY}" # https://platform.deepseek.com/api_keys evalplus.evaluate --model "deepseek-chat" \ --dataset [humaneval|mbpp] \ --base-url https://api.deepseek.com \ --backend openai --greedy
Grok
export OPENAIAPIKEY="{KEY}" # https://console.x.ai/ evalplus.evaluate --model "grok-beta" \ --dataset [humaneval|mbpp] \ --base-url https://api.x.ai/v1 \ --backend openai --greedy
vLLM server
First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deployingwithdocker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --base-url http://localhost:8000/v1 \ --backend openai --greedy
GPTQModel
evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ --dataset [humaneval|mbpp] \ --backend gptqmodel --greedy ```
OpenAI models
- Access OpenAI APIs from OpenAI Console
bash
export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o" \
--dataset [humaneval|mbpp] \
--backend openai \
--greedy
Anthropic models
- Access Anthropic APIs from Anthropic Console
bash
export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
--dataset [humaneval|mbpp] \
--backend anthropic \
--greedy
Google Gemini models
- Access Gemini APIs from Google AI Studio
bash
export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro" \
--dataset [humaneval|mbpp] \
--backend google \
--greedy
Amazon Bedrock models
bash
export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]"
evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \
--dataset [humaneval|mbpp] \
--backend bedrock \
--greedy
Ollama backend
bash
evalplus.evaluate --model "mistral:7b" \
--dataset [humaneval|mbpp] \
--backend ollama \
--base-url http://localhost:11434/v1 \
--greedy
You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/
⏬ Using EvalPlus as a local repo? :: click to expand ::
📚 Documents
To learn more about how to use EvalPlus, please refer to:
📜 Citation
```bibtex @inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, }
@inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, } ```
🙏 Acknowledgement
Owner
- Name: evalplus
- Login: evalplus
- Kind: organization
- Repositories: 1
- Profile: https://github.com/evalplus
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: EvalPlus
authors:
- family-names: EvalPlus Team
url: https://github.com/evalplus/evalplus
doi: https://doi.org/10.48550/arXiv.2305.01210
date-released: 2023-05-01
license: Apache-2.0
preferred-citation:
type: article
title: "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
authors:
- family-names: Liu
given-names: Jiawei
- family-names: Xia
given-names: Chunqiu Steven
- family-names: Wang
given-names: Yuyao
- family-names: Zhang
given-names: Lingming
year: 2023
journal: "arXiv preprint arXiv:2305.01210"
doi: https://doi.org/10.48550/arXiv.2305.01210
url: https://arxiv.org/abs/2305.01210
GitHub Events
Total
- Create event: 4
- Release event: 1
- Issues event: 37
- Watch event: 336
- Delete event: 1
- Issue comment event: 79
- Push event: 63
- Pull request event: 39
- Pull request review comment event: 7
- Pull request review event: 16
- Fork event: 64
Last Year
- Create event: 4
- Release event: 1
- Issues event: 37
- Watch event: 336
- Delete event: 1
- Issue comment event: 79
- Push event: 63
- Pull request event: 39
- Pull request review comment event: 7
- Pull request review event: 16
- Fork event: 64
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 194
- Total pull requests: 126
- Average time to close issues: 14 days
- Average time to close pull requests: 5 days
- Total issue authors: 121
- Total pull request authors: 34
- Average comments per issue: 1.93
- Average comments per pull request: 0.77
- Merged pull requests: 94
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 30
- Pull requests: 39
- Average time to close issues: 14 days
- Average time to close pull requests: 11 days
- Issue authors: 29
- Pull request authors: 15
- Average comments per issue: 1.3
- Average comments per pull request: 0.74
- Merged pull requests: 25
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ganler (14)
- ethanc8 (12)
- ajinkya123-robo (9)
- uukuguy (8)
- mrigankpawagi (5)
- Romainsauvestre (4)
- marcusm117 (4)
- sabagithub (3)
- Shlok-crypto (3)
- Nondzu (3)
- soryxie (3)
- davyzhu (2)
- zhimin-z (2)
- nanowell (2)
- RoacherM (2)
Pull Request Authors
- soryxie (19)
- FatPigeorz (14)
- CL-ModelCloud (10)
- Co1lin (10)
- UniverseFly (9)
- aksakalmustafa (6)
- Kristoff-starling (6)
- terryyz (4)
- nalinabrol (4)
- ganler (4)
- AnitaLiu98 (3)
- edgan8 (2)
- hwaking (2)
- jasonzliang (2)
- younesbelkada (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 5,715 last-month
- Total docker downloads: 94
-
Total dependent packages: 1
(may contain duplicates) -
Total dependent repositories: 4
(may contain duplicates) - Total versions: 25
- Total maintainers: 1
proxy.golang.org: github.com/evalplus/evalplus
- Documentation: https://pkg.go.dev/github.com/evalplus/evalplus#section-documentation
- License: apache-2.0
-
Latest release: v0.3.1
published over 1 year ago
Rankings
pypi.org: evalplus
"EvalPlus for rigourous evaluation of LLM-synthesized code"
- Homepage: https://github.com/evalplus/evalplus
- Documentation: https://evalplus.readthedocs.io/
- License: Apache-2.0
-
Latest release: 0.3.1
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- python 3.8-slim-buster build
- fschat *
- openai *
- rich *
- matplotlib *
- numpy *
- rich *
- tempdir *
- termcolor *
- tqdm *
- appdirs *
- multipledispatch *
- numpy *
- tempdir *
- tqdm *
- wget *