Recent Releases of bigcodebench

bigcodebench - Release BigCodeBench v0.2.5

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.4...v0.2.5

- Python
Published by terryyz about 1 year ago

bigcodebench - Release BigCodeBench v0.2.4

What's Changed

  • fix makerawchat_prompt when prefill is disabled by @zhangchen-xu in https://github.com/bigcode-project/bigcodebench/pull/75
  • Specify a unique cache directory before each code execution by @shwinshaker in https://github.com/bigcode-project/bigcodebench/pull/77
  • fix E2b execution debug by @terryyz in https://github.com/bigcode-project/bigcodebench/pull/79
  • fix e2b by @terryyz in https://github.com/bigcode-project/bigcodebench/pull/80
  • Add support for Hugging Face Serverless Inference by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/85
  • Reintroduce progress checker from #48 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/86
  • Fixes for tasks 211 and 215 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/49

New Contributors

  • @zhangchen-xu made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/75
  • @shwinshaker made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/77

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.3...v0.2.4

- Python
Published by terryyz about 1 year ago

bigcodebench - Release BigCodeBench v0.2.3.post1

What's Changed

  • Fix Docker image and its dependencies
  • Support more models with reasoning effort
  • Optional chat prefilling
  • E2B, Gradio, and Local code execution

Evaluated LLMs (173 models)

  • o3-mini
  • DeepSeek R1

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post7...v0.2.3.post1

- Python
Published by terryyz over 1 year ago

bigcodebench - v0.2.1.post7

What's Changed

  • Fix Docker image and its dependencies
  • Fix o1 concurrent generation output collection
  • Update the code sanitization

Evaluated LLMs (157 models)

  • o1-2024-12-17
  • Gemini-2.0 series

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post3...v0.2.1.post7

- Python
Published by terryyz over 1 year ago

bigcodebench - BigCodeBench v0.2.1.post2

What's Changed

  • Fix calibration setting in the code evaluation.
  • Add --no_execute argument for code evaluation.
  • Support concurrent API inference for o1 and deepseek-chat.
  • Fix API inference for Google Gemini.
  • Add --instruction_prefix and --response_prefix arguments for code generation.
  • Change --id_range input type.
  • Add --revision arguments for code generation.

Evaluated LLMs (144 models)

  • Qwen2.5-Coder-32B-Instruct
  • grok-beta
  • claude-3-5-haiku-20241022

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.0...v0.2.1.post2

- Python
Published by terryyz over 1 year ago

bigcodebench - Release BigCodeBench v0.2.0

Breaking Change

  • No more waiting! The evalution now fully supports batch inference!
  • No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
  • No more multiple commands! bigcodebench.evaluate will be good enough to handle most cases.

What's Changed

  • add multiprocessing support for sanitization step by @sk-g in https://github.com/bigcode-project/bigcodebench/pull/37
  • Remove extra period in task BigCodeBench/16 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/38
  • Await futures in progress checker by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/48
  • A few args have been added to this version, including --direct_completion and --local_execute. See Advanced Usage for the details.

Dataset maintainence

  • The benchmark data has been bumped to v0.1.2. You can load the dataset with from datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2")
  • BigCodeBench/16: removed period
  • BigCodeBench/37: added pandas requirement
  • BigCodeBench/178: removed urlib requirement
  • BigCodeBench/241: added required plot title
  • BigCodeBench/267: added required plot title
  • BigCodeBench/760: changed the import of datetime
  • BigCodeBench/1006: replaced test links due to the potential connection block

New Contributors

  • @sk-g made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/37
  • @hvaara made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/38

Evaluated LLMs (139 models)

  • o1-Preview-2024-09-12 (temperature=1)
  • Gemini-1.5-Pro-002
  • Llama-3.1 models
  • DeepSeek-V2.5
  • Qwen-2.5 models
  • Qwen-2.5-Coder models
  • and more

PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.9...v0.2.0.post3

- Python
Published by terryyz over 1 year ago

bigcodebench - Release BigCodeBench v0.1.9

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.9

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.8

Features: - Support BigCodeBench-Hard subset: https://github.com/bigcode-project/bigcodebench/pull/17 - Identify and fix tokenizer setup: https://github.com/bigcode-project/bigcodebench/issues/21 - Customize the tokenizer: https://github.com/bigcode-project/bigcodebench/pull/20 - Add the pass rate result log: https://github.com/bigcode-project/bigcodebench/pull/20

Contributors: - @marianna13: https://github.com/bigcode-project/bigcodebench/pull/20

Models: - A total of 96 models at the time of the release

Acknowledgement: - @ethanc8 - @takkyu2 - @imamnurby

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.8

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release v0.1.7.post2

  • Enhanced the calculation of ground truth pass rate, and addressed the issue mentioned in https://github.com/bigcode-project/bigcodebench/pull/12#issuecomment-2199186199.
  • Update the README docs.

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.7

Fix some identified issues: - The ground truth pass rate was not previously computed in the correct way. - Passed RAM limits would raise errors, as they were set as float type. - User permission is not correctly set up in the Evaluate Docker.

Features: -- check-gt-only will print out the pass rate when finishing.

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.6

New features;

  • The RAM setup is now adjustable via specific arguments.
  • Parallel ground truth checking is supported. Potentially failed checks are skipped during execution. A warning will be issued if the ground truth pass rate falls below 0.95.

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.5

New features;

  • The data is downloaded from HF hub by default.
  • Data formats have been unified for the one on HF and the one on GitHub.

- Python
Published by terryyz almost 2 years ago

bigcodebench - BigCodeBench v0.1.2

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release v0.1.0

- Python
Published by terryyz about 2 years ago