evalplus

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

https://github.com/evalplus/evalplus

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

benchmark chatgpt efficiency gpt-4 large-language-models program-synthesis testing
Last synced: 6 months ago · JSON representation ·

Repository

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

Basic Info
  • Host: GitHub
  • Owner: evalplus
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Homepage: https://evalplus.github.io
  • Size: 5.3 MB
Statistics
  • Stars: 1,557
  • Watchers: 10
  • Forks: 172
  • Open Issues: 50
  • Releases: 11
Topics
benchmark chatgpt efficiency gpt-4 large-language-models program-synthesis testing
Created almost 3 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

EvalPlus(📖) => 📚

📙About🔥Quick Start🚀LLM Backends📚Documents📜Citation🙏Acknowledgement

📢 News

Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:

Below tracks the notable updates of EvalPlus:

  • [2024-10-20 v0.3.1]: EvalPlus v0.3.1 is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
  • [2024-06-09 pre v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.
  • [2024-04-17 pre v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
Earlier news :: click to expand ::
- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32). - ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6) - ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!

📙 About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

  • HumanEval+: 80x more tests than the original HumanEval!
  • MBPP+: 35x more tests than the original MBPP!
  • EvalPerf: evaluating the efficiency of LLM-generated code!
  • Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

  • Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
  • Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
  • Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

```bash pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"

Or pip install "evalplus[vllm]" --upgrade for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy ```

🛡️ Safe code execution within Docker :: click to expand ::
```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset humaneval \ --backend vllm \ --greedy # Code execution within Docker docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evaluate --dataset humaneval \ --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl ```

Code Efficiency Evaluation: EvalPerf (*nix only)

```bash pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"

Or pip install "evalplus[perf,vllm]" --upgrade for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perfeventparanoid' # Enable perf evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm ```

🛡️ Safe code execution within Docker :: click to expand ::
```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset evalperf \ --backend vllm \ --temperature 1.0 \ --n-samples 100 # Code execution within Docker sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl ```

🚀 LLM Backends

HuggingFace models

  • transformers backend:

bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --greedy

[!Note]

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::
```bash # Install Flash Attention 2 pip install packaging ninja pip install flash-attn --no-build-isolation # Note: if you have installation problem, consider using pre-built # wheels from https://github.com/Dao-AILab/flash-attention/releases # Run evaluation with FA2 evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --attn-implementation [flash_attention_2|sdpa] \ --greedy ```
  • vllm backend:

bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --tp [TENSOR_PARALLEL_SIZE] \ --greedy

  • openai compatible servers (e.g., vLLM):

```bash

OpenAI models

export OPENAIAPIKEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys evalplus.evaluate --model "gpt-4o-2024-08-06" \ --dataset [humaneval|mbpp] \ --backend openai --greedy

DeepSeek

export OPENAIAPIKEY="{KEY}" # https://platform.deepseek.com/api_keys evalplus.evaluate --model "deepseek-chat" \ --dataset [humaneval|mbpp] \ --base-url https://api.deepseek.com \ --backend openai --greedy

Grok

export OPENAIAPIKEY="{KEY}" # https://console.x.ai/ evalplus.evaluate --model "grok-beta" \ --dataset [humaneval|mbpp] \ --base-url https://api.x.ai/v1 \ --backend openai --greedy

vLLM server

First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deployingwithdocker.html

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --base-url http://localhost:8000/v1 \ --backend openai --greedy

GPTQModel

evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ --dataset [humaneval|mbpp] \ --backend gptqmodel --greedy ```

OpenAI models

bash export OPENAI_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gpt-4o" \ --dataset [humaneval|mbpp] \ --backend openai \ --greedy

Anthropic models

bash export ANTHROPIC_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "claude-3-haiku-20240307" \ --dataset [humaneval|mbpp] \ --backend anthropic \ --greedy

Google Gemini models

bash export GOOGLE_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gemini-1.5-pro" \ --dataset [humaneval|mbpp] \ --backend google \ --greedy

Amazon Bedrock models

bash export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]" evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \ --dataset [humaneval|mbpp] \ --backend bedrock \ --greedy

Ollama backend

bash evalplus.evaluate --model "mistral:7b" \ --dataset [humaneval|mbpp] \ --backend ollama \ --base-url http://localhost:11434/v1 \ --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::
```bash git clone https://github.com/evalplus/evalplus.git cd evalplus export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -r requirements.txt ```

📚 Documents

To learn more about how to use EvalPlus, please refer to:

📜 Citation

```bibtex @inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, }

@inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, } ```

🙏 Acknowledgement

Owner

  • Name: evalplus
  • Login: evalplus
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: EvalPlus
authors:
  - family-names: EvalPlus Team
url: https://github.com/evalplus/evalplus
doi: https://doi.org/10.48550/arXiv.2305.01210
date-released: 2023-05-01
license: Apache-2.0
preferred-citation:
  type: article
  title: "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
  authors:
    - family-names: Liu
      given-names: Jiawei
    - family-names: Xia
      given-names: Chunqiu Steven
    - family-names: Wang
      given-names: Yuyao
    - family-names: Zhang
      given-names: Lingming
  year: 2023
  journal: "arXiv preprint arXiv:2305.01210"
  doi: https://doi.org/10.48550/arXiv.2305.01210
  url: https://arxiv.org/abs/2305.01210

GitHub Events

Total
  • Create event: 4
  • Release event: 1
  • Issues event: 37
  • Watch event: 336
  • Delete event: 1
  • Issue comment event: 79
  • Push event: 63
  • Pull request event: 39
  • Pull request review comment event: 7
  • Pull request review event: 16
  • Fork event: 64
Last Year
  • Create event: 4
  • Release event: 1
  • Issues event: 37
  • Watch event: 336
  • Delete event: 1
  • Issue comment event: 79
  • Push event: 63
  • Pull request event: 39
  • Pull request review comment event: 7
  • Pull request review event: 16
  • Fork event: 64

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 194
  • Total pull requests: 126
  • Average time to close issues: 14 days
  • Average time to close pull requests: 5 days
  • Total issue authors: 121
  • Total pull request authors: 34
  • Average comments per issue: 1.93
  • Average comments per pull request: 0.77
  • Merged pull requests: 94
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 30
  • Pull requests: 39
  • Average time to close issues: 14 days
  • Average time to close pull requests: 11 days
  • Issue authors: 29
  • Pull request authors: 15
  • Average comments per issue: 1.3
  • Average comments per pull request: 0.74
  • Merged pull requests: 25
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ganler (14)
  • ethanc8 (12)
  • ajinkya123-robo (9)
  • uukuguy (8)
  • mrigankpawagi (5)
  • Romainsauvestre (4)
  • marcusm117 (4)
  • sabagithub (3)
  • Shlok-crypto (3)
  • Nondzu (3)
  • soryxie (3)
  • davyzhu (2)
  • zhimin-z (2)
  • nanowell (2)
  • RoacherM (2)
Pull Request Authors
  • soryxie (19)
  • FatPigeorz (14)
  • CL-ModelCloud (10)
  • Co1lin (10)
  • UniverseFly (9)
  • aksakalmustafa (6)
  • Kristoff-starling (6)
  • terryyz (4)
  • nalinabrol (4)
  • ganler (4)
  • AnitaLiu98 (3)
  • edgan8 (2)
  • hwaking (2)
  • jasonzliang (2)
  • younesbelkada (2)
Top Labels
Issue Labels
model eval (90) bug (28) program contract (13) enhancement (13) good first issue (11) high priority (6) help wanted (6) question (5) actionable (4) incomplete (2) invalid (2) oracle (2) incorrect groundtruth (1) new model (1)
Pull Request Labels
model eval (2)

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 5,715 last-month
  • Total docker downloads: 94
  • Total dependent packages: 1
    (may contain duplicates)
  • Total dependent repositories: 4
    (may contain duplicates)
  • Total versions: 25
  • Total maintainers: 1
proxy.golang.org: github.com/evalplus/evalplus
  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.5%
Average: 6.7%
Dependent repos count: 7.0%
Last synced: 6 months ago
pypi.org: evalplus

"EvalPlus for rigourous evaluation of LLM-synthesized code"

  • Versions: 12
  • Dependent Packages: 1
  • Dependent Repositories: 4
  • Downloads: 5,715 Last month
  • Docker Downloads: 94
Rankings
Stargazers count: 2.9%
Docker downloads count: 3.3%
Forks count: 6.4%
Average: 7.1%
Dependent repos count: 7.5%
Dependent packages count: 10.1%
Downloads: 12.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

Dockerfile docker
  • python 3.8-slim-buster build
pyproject.toml pypi
requirements-llm.txt pypi
  • fschat *
  • openai *
  • rich *
requirements-tools.txt pypi
  • matplotlib *
  • numpy *
  • rich *
  • tempdir *
  • termcolor *
  • tqdm *
requirements.txt pypi
  • appdirs *
  • multipledispatch *
  • numpy *
  • tempdir *
  • tqdm *
  • wget *