Recent Releases of bigcodebench
bigcodebench - Release BigCodeBench v0.2.5
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.4...v0.2.5
- Python
Published by terryyz about 1 year ago
bigcodebench - Release BigCodeBench v0.2.4
What's Changed
- fix makerawchat_prompt when prefill is disabled by @zhangchen-xu in https://github.com/bigcode-project/bigcodebench/pull/75
- Specify a unique cache directory before each code execution by @shwinshaker in https://github.com/bigcode-project/bigcodebench/pull/77
- fix E2b execution debug by @terryyz in https://github.com/bigcode-project/bigcodebench/pull/79
- fix e2b by @terryyz in https://github.com/bigcode-project/bigcodebench/pull/80
- Add support for Hugging Face Serverless Inference by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/85
- Reintroduce progress checker from #48 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/86
- Fixes for tasks 211 and 215 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/49
New Contributors
- @zhangchen-xu made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/75
- @shwinshaker made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/77
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.3...v0.2.4
- Python
Published by terryyz about 1 year ago
bigcodebench - Release BigCodeBench v0.2.3.post1
What's Changed
- Fix Docker image and its dependencies
- Support more models with reasoning effort
- Optional chat prefilling
- E2B, Gradio, and Local code execution
Evaluated LLMs (173 models)
- o3-mini
- DeepSeek R1
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post7...v0.2.3.post1
- Python
Published by terryyz over 1 year ago
bigcodebench - v0.2.1.post7
What's Changed
- Fix Docker image and its dependencies
- Fix o1 concurrent generation output collection
- Update the code sanitization
Evaluated LLMs (157 models)
- o1-2024-12-17
- Gemini-2.0 series
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post3...v0.2.1.post7
- Python
Published by terryyz over 1 year ago
bigcodebench - BigCodeBench v0.2.1.post2
What's Changed
- Fix
calibrationsetting in the code evaluation. - Add
--no_executeargument for code evaluation. - Support concurrent API inference for
o1anddeepseek-chat. - Fix API inference for Google Gemini.
- Add
--instruction_prefixand--response_prefixarguments for code generation. - Change
--id_rangeinput type. - Add
--revisionarguments for code generation.
Evaluated LLMs (144 models)
- Qwen2.5-Coder-32B-Instruct
- grok-beta
- claude-3-5-haiku-20241022
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.0...v0.2.1.post2
- Python
Published by terryyz over 1 year ago
bigcodebench - Release BigCodeBench v0.2.0
Breaking Change
- No more waiting! The evalution now fully supports batch inference!
- No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
- No more multiple commands!
bigcodebench.evaluatewill be good enough to handle most cases.
What's Changed
- add multiprocessing support for sanitization step by @sk-g in https://github.com/bigcode-project/bigcodebench/pull/37
- Remove extra period in task BigCodeBench/16 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/38
- Await futures in progress checker by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/48
- A few args have been added to this version, including
--direct_completionand--local_execute. See Advanced Usage for the details.
Dataset maintainence
- The benchmark data has been bumped to
v0.1.2. You can load the dataset withfrom datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2") BigCodeBench/16: removed periodBigCodeBench/37: added pandas requirementBigCodeBench/178: removedurlibrequirementBigCodeBench/241: added required plot titleBigCodeBench/267: added required plot titleBigCodeBench/760: changed the import ofdatetimeBigCodeBench/1006: replaced test links due to the potential connection block
New Contributors
- @sk-g made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/37
- @hvaara made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/38
Evaluated LLMs (139 models)
- o1-Preview-2024-09-12 (temperature=1)
- Gemini-1.5-Pro-002
- Llama-3.1 models
- DeepSeek-V2.5
- Qwen-2.5 models
- Qwen-2.5-Coder models
- and more
PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.9...v0.2.0.post3
- Python
Published by terryyz over 1 year ago
bigcodebench - Release BigCodeBench v0.1.9
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.9
- Python
Published by terryyz almost 2 years ago
bigcodebench - Release BigCodeBench v0.1.8
Features:
- Support BigCodeBench-Hard subset: https://github.com/bigcode-project/bigcodebench/pull/17
- Identify and fix tokenizer setup: https://github.com/bigcode-project/bigcodebench/issues/21
- Customize the tokenizer: https://github.com/bigcode-project/bigcodebench/pull/20
- Add the pass rate result log: https://github.com/bigcode-project/bigcodebench/pull/20
Contributors: - @marianna13: https://github.com/bigcode-project/bigcodebench/pull/20
Models: - A total of 96 models at the time of the release
Acknowledgement: - @ethanc8 - @takkyu2 - @imamnurby
Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.8
- Python
Published by terryyz almost 2 years ago
bigcodebench - Release v0.1.7.post2
- Enhanced the calculation of ground truth pass rate, and addressed the issue mentioned in https://github.com/bigcode-project/bigcodebench/pull/12#issuecomment-2199186199.
- Update the README docs.
- Python
Published by terryyz almost 2 years ago
bigcodebench - Release BigCodeBench v0.1.7
Fix some identified issues: - The ground truth pass rate was not previously computed in the correct way. - Passed RAM limits would raise errors, as they were set as float type. - User permission is not correctly set up in the Evaluate Docker.
Features:
-- check-gt-only will print out the pass rate when finishing.
- Python
Published by terryyz almost 2 years ago
bigcodebench - Release BigCodeBench v0.1.6
New features;
- The RAM setup is now adjustable via specific arguments.
- Parallel ground truth checking is supported. Potentially failed checks are skipped during execution. A warning will be issued if the ground truth pass rate falls below 0.95.
- Python
Published by terryyz almost 2 years ago
bigcodebench - Release BigCodeBench v0.1.5
New features;
- The data is downloaded from HF hub by default.
- Data formats have been unified for the one on HF and the one on GitHub.
- Python
Published by terryyz almost 2 years ago