evalplus

evalplus for DataLeaderboard

https://github.com/opendataarena/evalplus

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

evalplus for DataLeaderboard

Basic Info

Host: GitHub
Owner: OpenDataArena
License: apache-2.0
Language: Python
Default Branch: main
Size: 4.13 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

`EvalPlus(📖) => 📚`

📙About • 🔥Quick Start • 🚀LLM Backends • 📚Documents • 📜Citation • 🙏Acknowledgement

📢 News

Who's using EvalPlus datasets? EvalPlus has been used by various LLM teams, including:

Below tracks the notable updates of EvalPlus:

[2024-10-20 v0.3.1]: EvalPlus v0.3.1 is officially released! Highlights: (i) Code efficiency evaluation via EvalPerf, (ii) one command to run all: generation + post-processing + evaluation, (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
[2024-06-09 pre v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.
[2024-04-17 pre v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.

Earlier news :: click to expand ::

- ([`v0.2.1`](https://github.com/evalplus/evalplus/releases/tag/v0.2.1)) You can use EvalPlus datasets via [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness)! HumanEval+ oracle fixes (32). - ([`v0.2.0`](https://github.com/evalplus/evalplus/releases/tag/v0.2.0)) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160). - ([`v0.1.7`](https://github.com/evalplus/evalplus/releases/tag/v0.1.7)) [Leaderboard](https://evalplus.github.io/leaderboard.html) release; HumanEval+ contract and input fixes (32/166/126/6) - ([`v0.1.6`](https://github.com/evalplus/evalplus/releases/tag/v0.1.6)) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140) - ([`v0.1.5`](https://github.com/evalplus/evalplus/releases/tag/v0.1.5)) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples! - ([`v0.1.1`](https://github.com/evalplus/evalplus/releases/tag/v0.1.1)) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc. - ([`v0.1.0`](https://github.com/evalplus/evalplus/releases/tag/v0.1.0)) HumanEval+ is released!

📙 About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

✨ HumanEval+: 80x more tests than the original HumanEval!
✨ MBPP+: 35x more tests than the original MBPP!
✨ EvalPerf: evaluating the efficiency of LLM-generated code!
✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

✨ Precise evaluation: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
✨ Coding rigorousness: Look at the score differences! esp. before & after using EvalPlus tests! Less drop means more rigorousness in code generation; while a bigger drop means the generated code tends to be fragile.
✨ Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

EvalPlus: NeurIPS'23 paper, Slides, Poster, Leaderboard
EvalPerf: COLM'24 paper, Poster, Documentation, Leaderboard

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

```bash pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"

Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --greedy ```

🛡️ Safe code execution within Docker :: click to expand ::

```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset humaneval \ --backend vllm \ --greedy # Code execution within Docker docker run --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evaluate --dataset humaneval \ --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl ```

Code Efficiency Evaluation: EvalPerf (*nix only)

```bash pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"

Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perfeventparanoid' # Enable perf evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm ```

🛡️ Safe code execution within Docker :: click to expand ::

```bash # Local generation evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset evalperf \ --backend vllm \ --temperature 1.0 \ --n-samples 100 # Code execution within Docker sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf docker run --cap-add PERFMON --rm --pull=always -v $(pwd)/evalplus_results:/app ganler/evalplus:latest \ evalplus.evalperf --samples /app/evalperf/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl ```

🚀 LLM Backends

HuggingFace models

transformers backend:

bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --greedy

[!Note]

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::

```bash # Install Flash Attention 2 pip install packaging ninja pip install flash-attn --no-build-isolation # Note: if you have installation problem, consider using pre-built # wheels from https://github.com/Dao-AILab/flash-attention/releases # Run evaluation with FA2 evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend hf \ --attn-implementation [flash_attention_2|sdpa] \ --greedy ```

vllm backend:

bash evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --backend vllm \ --tp [TENSOR_PARALLEL_SIZE] \ --greedy

openai compatible servers (e.g., vLLM):

```bash

OpenAI models

export OPENAIAPIKEY="{KEY}" # https://platform.openai.com/settings/organization/api-keys evalplus.evaluate --model "gpt-4o-2024-08-06" \ --dataset [humaneval|mbpp] \ --backend openai --greedy

DeepSeek

export OPENAIAPIKEY="{KEY}" # https://platform.deepseek.com/api_keys evalplus.evaluate --model "deepseek-chat" \ --dataset [humaneval|mbpp] \ --base-url https://api.deepseek.com \ --backend openai --greedy

Grok

export OPENAIAPIKEY="{KEY}" # https://console.x.ai/ evalplus.evaluate --model "grok-beta" \ --dataset [humaneval|mbpp] \ --base-url https://api.x.ai/v1 \ --backend openai --greedy

vLLM server

First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deployingwithdocker.html

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \ --dataset [humaneval|mbpp] \ --base-url http://localhost:8000/v1 \ --backend openai --greedy

GPTQModel

evalplus.evaluate --model "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1" \ --dataset [humaneval|mbpp] \ --backend gptqmodel --greedy ```

OpenAI models

Access OpenAI APIs from OpenAI Console

bash export OPENAI_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gpt-4o" \ --dataset [humaneval|mbpp] \ --backend openai \ --greedy

Anthropic models

Access Anthropic APIs from Anthropic Console

bash export ANTHROPIC_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "claude-3-haiku-20240307" \ --dataset [humaneval|mbpp] \ --backend anthropic \ --greedy

Google Gemini models

Access Gemini APIs from Google AI Studio

bash export GOOGLE_API_KEY="[YOUR_API_KEY]" evalplus.evaluate --model "gemini-1.5-pro" \ --dataset [humaneval|mbpp] \ --backend google \ --greedy

Amazon Bedrock models

Amazon Bedrock

bash export BEDROCK_ROLE_ARN="[BEDROCK_ROLE_ARN]" evalplus.evaluate --model "anthropic.claude-3-5-sonnet-20241022-v2:0" \ --dataset [humaneval|mbpp] \ --backend bedrock \ --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::

```bash git clone https://github.com/evalplus/evalplus.git cd evalplus export PYTHONPATH=$PYTHONPATH:$(pwd) pip install -r requirements.txt ```

📚 Documents

To learn more about how to use EvalPlus, please refer to:

📜 Citation

```bibtex @inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, }

@inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, } ```

🙏 Acknowledgement

Owner

Login: OpenDataArena
Kind: user

Repositories: 1
Profile: https://github.com/OpenDataArena

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: EvalPlus
authors:
  - family-names: EvalPlus Team
url: https://github.com/evalplus/evalplus
doi: https://doi.org/10.48550/arXiv.2305.01210
date-released: 2023-05-01
license: Apache-2.0
preferred-citation:
  type: article
  title: "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation"
  authors:
    - family-names: Liu
      given-names: Jiawei
    - family-names: Xia
      given-names: Chunqiu Steven
    - family-names: Wang
      given-names: Yuyao
    - family-names: Zhang
      given-names: Lingming
  year: 2023
  journal: "arXiv preprint arXiv:2305.01210"
  doi: https://doi.org/10.48550/arXiv.2305.01210
  url: https://arxiv.org/abs/2305.01210

GitHub Events

Total

Push event: 1
Create event: 1

Last Year

Push event: 1
Create event: 1

Dependencies

Dockerfile docker

python 3.11-slim build

pyproject.toml pypi

requirements-evalperf.txt pypi

Pympler *
cirron *

requirements.txt pypi

anthropic *
appdirs *
boto3 *
datasets ==3.6.0
fire *
google-generativeai *
multipledispatch *
numpy *
openai *
psutil *
rich *
tempdir *
termcolor *
tqdm *
transformers *
tree-sitter ==0.21.3
tree-sitter-python *
vllm *
wget *

tests/requirements.txt pypi

pytest * test

tools/requirements.txt pypi

astor *
black *
matplotlib *
numpy *
rich *
tempdir *
termcolor *
tqdm *

tools/tsr/requirements.txt pypi

coverage *
mutmut ==2.1.0
rich *

evalplus

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

EvalPlus(📖) => 📚

📢 News

📙 About

🔥 Quick Start

Code Correctness Evaluation: HumanEval(+) or MBPP(+)

Or pip install "evalplus[vllm]" --upgrade for the latest stable release

Code Efficiency Evaluation: EvalPerf (*nix only)

Or pip install "evalplus[perf,vllm]" --upgrade for the latest stable release

🚀 LLM Backends

HuggingFace models

OpenAI models

DeepSeek

Grok

vLLM server

First, launch a vLLM server: https://docs.vllm.ai/en/latest/serving/deployingwithdocker.html

GPTQModel

OpenAI models

Anthropic models

Google Gemini models

Amazon Bedrock models

📚 Documents

📜 Citation

🙏 Acknowledgement

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

`EvalPlus(📖) => 📚`

Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release