bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Science Score: 64.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
1 of 23 committers (4.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Keywords
Repository
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
Basic Info
- Host: GitHub
- Owner: bigcode-project
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://bigcode-bench.github.io/
- Size: 6.52 MB
Statistics
- Stars: 416
- Watchers: 7
- Forks: 52
- Open Issues: 22
- Releases: 14
Topics
Metadata Files
README.md
BigCodeBench
💥 Impact • 📰 News • 🔥 Quick Start • 🚀 Remote Evaluation • 💻 LLM-generated Code • 🧑 Advanced Usage • 📰 Result Submission • 📜 Citation
🎉 Check out our latest work!
🌟 SWE Arena 🌟
🚀 Open Evaluation Platform on AI for Software Engineering 🚀
✨ 100% free to use the latest frontier models! ✨
💥 Impact
BigCodeBench has been trusted by many LLM teams including: - Zhipu AI - Alibaba Qwen - DeepSeek - Amazon AWS AI - Snowflake AI Research - ServiceNow Research - Meta AI - Cohere AI - Sakana AI - Allen Institute for Artificial Intelligence (AI2)
📰 News
- [2025-01-22] We are releasing
bigcodebench==v0.2.2.dev2, with 163 models evaluated! - [2024-10-06] We are releasing
bigcodebench==v0.2.0! - [2024-10-05] We create a public code execution API on the Hugging Face space.
- [2024-10-01] We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the leaderboard!
- [2024-08-19] To make the evaluation fully reproducible, we add a real-time code execution session to the leaderboard. It can be viewed here.
- [2024-08-02] We release
bigcodebench==v0.1.9.
More News :: click to expand ::
🌸 About
BigCodeBench
BigCodeBench is an easy-to-use benchmark for solving practical and challenging tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.
There are two splits in BigCodeBench:
- Complete: Thes split is designed for code completion based on the comprehensive docstrings.
- Instruct: The split works for the instruction-tuned and chat models only, where the models are asked to generate a code snippet based on the natural language instructions. The instructions only contain necessary information, and require more complex reasoning.
Why BigCodeBench?
BigCodeBench focuses on task automation via code generation with diverse function calls and complex instructions, with:
- ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!
🔥 Quick Start
To get started, please first set up the environment:
```bash
By default, you will use the remote evaluation API to execute the output samples.
pip install bigcodebench --upgrade
You are suggested to use flash-attn for generating code samples.
pip install packaging ninja pip install flash-attn --no-build-isolation
Note: if you have installation problem, consider using pre-built
wheels from https://github.com/Dao-AILab/flash-attention/releases
```
⏬ Install nightly version :: click to expand ::
🚀 Remote Evaluation
We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.
[!Warning]
To ease the generation, we use batch inference by default. However, the batch inference results could vary from batch sizes to batch sizes and versions to versions, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set
--bsto1.[!Note]
gradiobackend onBigCodeBench-Fulltypically takes 6-7 minutes, and onBigCodeBench-Hardtypically takes 4-5 minutes.e2bbackend with default machine onBigCodeBench-Fulltypically takes 25-30 minutes, and onBigCodeBench-Hardtypically takes 15-20 minutes.
bash
bigcodebench.evaluate \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--execution [e2b|gradio|local] \
--split [complete|instruct] \
--subset [full|hard] \
--backend [vllm|openai|anthropic|google|mistral|hf|hf-inference]
- All the resulted files will be stored in a folder named
bcb_results. - The generated code samples will be stored in a file named
[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl. - The evaluation results will be stored in a file named
[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json. - The pass@k results will be stored in a file named
[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json.
[!Note]
The
gradiobackend is hosted on the Hugging Face space by default. The default space can be sometimes slow, so we recommend you to use thegradiobackend with a cloned bigcodebench-evaluator endpoint for faster evaluation. Otherwise, you can also use thee2bsandbox for evaluation, which is also pretty slow on the default machine.[!Note]
BigCodeBench uses different prompts for base and chat models. By default it is detected by
tokenizer.chat_templatewhen usinghf/vllmas backend. For other backends, only chat mode is allowed.Therefore, if your base models come with a
tokenizer.chat_template, please add--direct_completionto avoid being evaluated in a chat mode.
To use E2B, you need to set up an account and get an API key from E2B.
bash
export E2B_API_KEY=<your_e2b_api_key>
Access OpenAI APIs from OpenAI Console
bash
export OPENAI_API_KEY=<your_openai_api_key>
Access Anthropic APIs from Anthropic Console
bash
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
Access Mistral APIs from Mistral Console
bash
export MISTRAL_API_KEY=<your_mistral_api_key>
Access Gemini APIs from Google AI Studio
bash
export GOOGLE_API_KEY=<your_google_api_key>
Access the Hugging Face Serverless Inference API
bash
export HF_INFERENCE_API_KEY=<your_hf_api_key>
Please make sure your HF access token has the Make calls to inference providers permission.
💻 LLM-generated Code
We share pre-generated code samples from LLMs we have evaluated on the full set:
* See the attachment of our v0.2.4. We include sanitized_samples_calibrated.zip for your convenience.
🧑 Advanced Usage
Please refer to the ADVANCED USAGE for more details.
📰 Result Submission
Please email both the generated code samples and the execution results to terry.zhuo@monash.edu if you would like to contribute your model to the leaderboard. Note that the file names should be in the format of [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl and [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json. You can file an issue to remind us if we do not respond to your email within 3 days.
📜 Citation
bibtex
@article{zhuo2024bigcodebench,
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
journal={arXiv preprint arXiv:2406.15877},
year={2024}
}
🙏 Acknowledgement
Owner
- Name: BigCode Project
- Login: bigcode-project
- Kind: organization
- Email: contact@bigcode-project.org
- Website: https://www.bigcode-project.org/
- Twitter: BigCodeProject
- Repositories: 26
- Profile: https://github.com/bigcode-project
BigCode Project is an open scientific collaboration run by Hugging Face and ServiceNow Research, focused on open and responsible development of LLMs for code.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: BigCodeBench
authors:
- family-names: BigCodeBench Team
url: https://github.com/bigcode-project/bigcodebench
doi:
date-released: 2024-06-18
license: Apache-2.0
preferred-citation:
type: article
title: "Benchmarking Code Generation with Diverse Function Calls and Complex Instructions"
authors:
- family-names: BigCodeBench Team
year: 2024
journal:
doi:
url:
GitHub Events
Total
- Create event: 24
- Release event: 4
- Issues event: 70
- Watch event: 186
- Issue comment event: 123
- Push event: 115
- Pull request event: 23
- Fork event: 27
Last Year
- Create event: 24
- Release event: 4
- Issues event: 70
- Watch event: 186
- Issue comment event: 123
- Push event: 115
- Pull request event: 23
- Fork event: 27
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Terry Zhuo | t****5@g****m | 471 |
| ganler | j****u@g****m | 332 |
| Yuyao Wang | y****f@o****m | 250 |
| steven | s****0@g****m | 68 |
| Roy Hvaara | r****y@l****o | 11 |
| Junhao Wang | 7****z | 9 |
| Indraneil Paul | i****l@I****l | 7 |
| Songrun Xie | 7****e | 5 |
| marianna13 | m****0@g****m | 4 |
| Yuxiang Wei | y****i@g****m | 4 |
| Nalin Abrol | 3****l | 2 |
| AnitaLiu98 | 4****8 | 1 |
| Denis Akhiyarov | d****v@g****m | 1 |
| Jiale Tom Tian | 4****n | 1 |
| LRL | l****l@l****v | 1 |
| Naman Jain | n****n@g****m | 1 |
| Pepijn | 9****s | 1 |
| Yuyao Wang | y****6@o****m | 1 |
| Terry Zhuo (Monash University) | t****1@m****u | 1 |
| Sanjay Krishna Gouda | s****a@a****m | 1 |
| Jiawei Liu | j****6@i****u | 1 |
| Chengyu Dong | c****d@n****m | 1 |
| fly_dust | f****8@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 79
- Total pull requests: 49
- Average time to close issues: 11 days
- Average time to close pull requests: 8 days
- Total issue authors: 46
- Total pull request authors: 16
- Average comments per issue: 2.01
- Average comments per pull request: 0.61
- Merged pull requests: 37
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 54
- Pull requests: 25
- Average time to close issues: 12 days
- Average time to close pull requests: 15 days
- Issue authors: 35
- Pull request authors: 10
- Average comments per issue: 1.52
- Average comments per pull request: 0.52
- Merged pull requests: 17
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ethanc8 (10)
- terryyz (9)
- BradKML (5)
- hvaara (4)
- dmelcer9 (3)
- TomLucidor (3)
- blazgocompany (2)
- RitwikGupta (2)
- BenPortner (2)
- arsh-team (2)
- JohnLins (2)
- pengzhangzhi (1)
- tzurtutjuzrtzurtzurtz (1)
- wakamex (1)
- shwinshaker (1)
Pull Request Authors
- terryyz (16)
- hvaara (10)
- zhangchen-xu (2)
- lapidshay (2)
- LRL-ModelCloud (2)
- Devy99 (2)
- marianna13 (2)
- shwinshaker (2)
- egor-bogomolov (2)
- kanishkg (2)
- sk-g (2)
- alexazhou (1)
- iNeil77 (1)
- imamnurby (1)
- KMasaki0210 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 456 last-month
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 58
- Total maintainers: 1
proxy.golang.org: github.com/bigcode-project/bigcodebench
- Documentation: https://pkg.go.dev/github.com/bigcode-project/bigcodebench#section-documentation
- License: apache-2.0
-
Latest release: v0.2.5
published 9 months ago
Rankings
pypi.org: bigcodebench
"Evaluation package for BigCodeBench"
- Homepage: https://github.com/bigcode-project/bigcodebench
- Documentation: https://bigcodebench.readthedocs.io/
- License: Apache-2.0
-
Latest release: 0.2.5
published 9 months ago
Rankings
Maintainers (1)
Dependencies
- python 3.10-slim build
- accelerate *
- anthropic *
- appdirs *
- fire *
- mistralai *
- multipledispatch *
- numpy *
- openai *
- rich *
- stop-sequencer *
- tempdir *
- termcolor *
- tqdm *
- tree_sitter_languages *
- vllm *
- wget *
- pytest * test