evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

https://github.com/evalplus/evalplus

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Keywords

benchmark chatgpt efficiency gpt-4 large-language-models program-synthesis testing

Last synced: 10 months ago · JSON representation ·

Repository

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Basic Info

Host: GitHub
Owner: evalplus
License: apache-2.0
Language: Python
Default Branch: master
Homepage: https://evalplus.github.io
Size: 5.3 MB

Statistics

Stars: 1,557
Watchers: 10
Forks: 172
Open Issues: 50
Releases: 11

Topics

benchmark chatgpt efficiency gpt-4 large-language-models program-synthesis testing

Created about 3 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

`EvalPlus(📖) => 📚`

📙About • 🔥Quick Start • 🚀LLM Backends • 📚Documents • 📜Citation • 🙏Acknowledgement

📢 News

Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:

Below tracks the notable updates of EvalPlus:

[2024-10-20 v0.3.1]: EvalPlus v0.3.1 is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
[2024-06-09 pre v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.
[2024-04-17 pre v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.

Earlier news :: click to expand ::

- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32). - ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6) - ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!

📙 About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

✨ HumanEval+: 80x more tests than the original HumanEval!
✨ MBPP+: 35x more tests than the original MBPP!
✨ EvalPerf: evaluating the efficiency of LLM-generated code!
✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

✨ Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
✨ Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

EvalPlus: NeurIPS'23 paper, Slides, Poster, Leaderboard
EvalPerf: COLM'24 paper, Poster, Documentation, Leaderboard

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

```bash pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"

Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy ```

🛡️ Safe code execution within Docker :: click to expand ::

```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset humaneval \ --backend vllm \ --greedy # Code execution within Docker docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evaluate --dataset humaneval \ --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl ```

Code Efficiency Evaluation: EvalPerf (*nix only)

```bash pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"

Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perfeventparanoid' # Enable perf evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm ```

🛡️ Safe code execution within Docker :: click to expand ::

```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset evalperf \ --backend vllm \ --temperature 1.0 \ --n-samples 100 # Code execution within Docker sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl ```

🚀 LLM Backends

HuggingFace models

transformers backend:

bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --greedy

[!Note]

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::

```bash # Install Flash Attention 2 pip install packaging ninja pip install flash-attn --no-build-isolation # Note: if you have installation problem, consider using pre-built # wheels from https://github.com/Dao-AILab/flash-attention/releases # Run evaluation with FA2 evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --attn-implementation [flash_attention_2|sdpa] \ --greedy ```

vllm backend:

bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --tp [TENSOR_PARALLEL_SIZE] \ --greedy

openai compatible servers (e.g., vLLM):

```bash

OpenAI models

export OPENAIAPIKEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys evalplus.evaluate --model "gpt-4o-2024-08-06" \ --dataset [humaneval|mbpp] \ --backend openai --greedy

DeepSeek

export OPENAIAPIKEY="{KEY}" # https://platform.deepseek.com/api_keys evalplus.evaluate --model "deepseek-chat" \ --dataset [humaneval|mbpp] \ --base-url https://api.deepseek.com \ --backend openai --greedy

Grok

export OPENAIAPIKEY="{KEY}" # https://console.x.ai/ evalplus.evaluate --model "grok-beta" \ --dataset [humaneval|mbpp] \ --base-url https://api.x.ai/v1 \ --backend openai --greedy

vLLM server

First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deployingwithdocker.html

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --base-url http://localhost:8000/v1 \ --backend openai --greedy

GPTQModel

evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ --dataset [humaneval|mbpp] \ --backend gptqmodel --greedy ```

OpenAI models

Access OpenAI APIs from OpenAI Console

bash export OPENAI_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gpt-4o" \ --dataset [humaneval|mbpp] \ --backend openai \ --greedy

Anthropic models

Access Anthropic APIs from Anthropic Console

bash export ANTHROPIC_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "claude-3-haiku-20240307" \ --dataset [humaneval|mbpp] \ --backend anthropic \ --greedy

Google Gemini models

Access Gemini APIs from Google AI Studio

bash export GOOGLE_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gemini-1.5-pro" \ --dataset [humaneval|mbpp] \ --backend google \ --greedy

Amazon Bedrock models

Amazon Bedrock

bash export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]" evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \ --dataset [humaneval|mbpp] \ --backend bedrock \ --greedy

Ollama backend

Ollama

bash evalplus.evaluate --model "mistral:7b" \ --dataset [humaneval|mbpp] \ --backend ollama \ --base-url http://localhost:11434/v1 \ --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::

```bash git clone https://github.com/evalplus/evalplus.git cd evalplus export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -r requirements.txt ```

📚 Documents

To learn more about how to use EvalPlus, please refer to:

📜 Citation

```bibtex @inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, }

@inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, } ```

🙏 Acknowledgement

Owner

Name: evalplus
Login: evalplus
Kind: organization

Repositories: 1
Profile: https://github.com/evalplus

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: EvalPlus
authors:
  - family-names: EvalPlus Team
url: https://github.com/evalplus/evalplus
doi: https://doi.org/10.48550/arXiv.2305.01210
date-released: 2023-05-01
license: Apache-2.0
preferred-citation:
  type: article
  title: "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
  authors:
    - family-names: Liu
      given-names: Jiawei
    - family-names: Xia
      given-names: Chunqiu Steven
    - family-names: Wang
      given-names: Yuyao
    - family-names: Zhang
      given-names: Lingming
  year: 2023
  journal: "arXiv preprint arXiv:2305.01210"
  doi: https://doi.org/10.48550/arXiv.2305.01210
  url: https://arxiv.org/abs/2305.01210

GitHub Events

Total

Create event: 4
Release event: 1
Issues event: 37
Watch event: 336
Delete event: 1
Issue comment event: 79
Push event: 63
Pull request event: 39
Pull request review comment event: 7
Pull request review event: 16
Fork event: 64

Last Year

Create event: 4
Release event: 1
Issues event: 37
Watch event: 336
Delete event: 1
Issue comment event: 79
Push event: 63
Pull request event: 39
Pull request review comment event: 7
Pull request review event: 16
Fork event: 64

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 194
Total pull requests: 126
Average time to close issues: 14 days
Average time to close pull requests: 5 days
Total issue authors: 121
Total pull request authors: 34
Average comments per issue: 1.93
Average comments per pull request: 0.77
Merged pull requests: 94
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 30
Pull requests: 39
Average time to close issues: 14 days
Average time to close pull requests: 11 days
Issue authors: 29
Pull request authors: 15
Average comments per issue: 1.3
Average comments per pull request: 0.74
Merged pull requests: 25
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ganler (14)
ethanc8 (12)
ajinkya123-robo (9)
uukuguy (8)
mrigankpawagi (5)
Romainsauvestre (4)
marcusm117 (4)
sabagithub (3)
Shlok-crypto (3)
Nondzu (3)
soryxie (3)
davyzhu (2)
zhimin-z (2)
nanowell (2)
RoacherM (2)

Pull Request Authors

soryxie (19)
FatPigeorz (14)
CL-ModelCloud (10)
Co1lin (10)
UniverseFly (9)
aksakalmustafa (6)
Kristoff-starling (6)
terryyz (4)
nalinabrol (4)
ganler (4)
AnitaLiu98 (3)
edgan8 (2)
hwaking (2)
jasonzliang (2)
younesbelkada (2)

Top Labels

Issue Labels

model eval (90) bug (28) program contract (13) enhancement (13) good first issue (11) high priority (6) help wanted (6) question (5) actionable (4) incomplete (2) invalid (2) oracle (2) incorrect groundtruth (1) new model (1)

Pull Request Labels

model eval (2)

Packages

Total packages: 2
Total downloads:
- pypi 5,715 last-month
Total docker downloads: 94

Total dependent packages: 1
(may contain duplicates)
Total dependent repositories: 4
(may contain duplicates)
Total versions: 25
Total maintainers: 1

proxy.golang.org: github.com/evalplus/evalplus

Documentation: https://pkg.go.dev/github.com/evalplus/evalplus#section-documentation
License: apache-2.0
Latest release: v0.3.1
published over 1 year ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.5%

Average: 6.7%

Dependent repos count: 7.0%

Last synced: 11 months ago

pypi.org: evalplus

"EvalPlus for rigourous evaluation of LLM-synthesized code"

Homepage: https://github.com/evalplus/evalplus
Documentation: https://evalplus.readthedocs.io/
License: Apache-2.0
Latest release: 0.3.1
published over 1 year ago

Versions: 12
Dependent Packages: 1
Dependent Repositories: 4
Downloads: 5,715 Last month
Docker Downloads: 94

Rankings

Stargazers count: 2.9%

Docker downloads count: 3.3%

Forks count: 6.4%

Average: 7.1%

Dependent repos count: 7.5%

Dependent packages count: 10.1%

Downloads: 12.5%

Maintainers (1)

ganler

Last synced: 11 months ago

Dependencies

Dockerfile docker

python 3.8-slim-buster build

pyproject.toml pypi

requirements-llm.txt pypi

fschat *
openai *
rich *

requirements-tools.txt pypi

matplotlib *
numpy *
rich *
tempdir *
termcolor *
tqdm *

requirements.txt pypi

appdirs *
multipledispatch *
numpy *
tempdir *
tqdm *
wget *

evalplus

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

EvalPlus(📖) => 📚

📢 News

📙 About

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

Or pip install "evalplus[vllm]" --upgrade for the latest stable release

Code Efficiency Evaluation: EvalPerf (*nix only)

Or pip install "evalplus[perf,vllm]" --upgrade for the latest stable release

🚀 LLM Backends

HuggingFace models

OpenAI models

DeepSeek

Grok

vLLM server

First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deployingwithdocker.html

GPTQModel

OpenAI models

Anthropic models

Google Gemini models

Amazon Bedrock models

Ollama backend

📚 Documents

📜 Citation

🙏 Acknowledgement

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/evalplus/evalplus

Rankings

pypi.org: evalplus

Rankings

Maintainers (1)

Dependencies

`EvalPlus(📖) => 📚`

Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release