bigcodebench - Release BigCodeBench v0.2.5

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.4...v0.2.5

- Python
Published by terryyz about 1 year ago

bigcodebench - Release BigCodeBench v0.2.4

What's Changed

fix makerawchat_prompt when prefill is disabled by @zhangchen-xu in https://github.com/bigcode-project/bigcodebench/pull/75
Specify a unique cache directory before each code execution by @shwinshaker in https://github.com/bigcode-project/bigcodebench/pull/77
fix E2b execution debug by @terryyz in https://github.com/bigcode-project/bigcodebench/pull/79
fix e2b by @terryyz in https://github.com/bigcode-project/bigcodebench/pull/80
Add support for Hugging Face Serverless Inference by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/85
Reintroduce progress checker from #48 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/86
Fixes for tasks 211 and 215 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/49

New Contributors

@zhangchen-xu made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/75
@shwinshaker made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/77

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.3...v0.2.4

- Python
Published by terryyz about 1 year ago

bigcodebench - Release BigCodeBench v0.2.3.post1

What's Changed

Fix Docker image and its dependencies
Support more models with reasoning effort
Optional chat prefilling
E2B, Gradio, and Local code execution

Evaluated LLMs (173 models)

o3-mini
DeepSeek R1

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post7...v0.2.3.post1

- Python
Published by terryyz over 1 year ago

bigcodebench - v0.2.1.post7

What's Changed

Fix Docker image and its dependencies
Fix o1 concurrent generation output collection
Update the code sanitization

Evaluated LLMs (157 models)

o1-2024-12-17
Gemini-2.0 series

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.1.post3...v0.2.1.post7

- Python
Published by terryyz over 1 year ago

bigcodebench - BigCodeBench v0.2.1.post2

What's Changed

Fix calibration setting in the code evaluation.
Add --no_execute argument for code evaluation.
Support concurrent API inference for o1 and deepseek-chat.
Fix API inference for Google Gemini.
Add --instruction_prefix and --response_prefix arguments for code generation.
Change --id_range input type.
Add --revision arguments for code generation.

Evaluated LLMs (144 models)

Qwen2.5-Coder-32B-Instruct
grok-beta
claude-3-5-haiku-20241022

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.2.0...v0.2.1.post2

- Python
Published by terryyz over 1 year ago

bigcodebench - Release BigCodeBench v0.2.0

Breaking Change

No more waiting! The evalution now fully supports batch inference!
No more environment configs! The code execution is done by a remote API endpoint by default, and can be customized.
No more multiple commands! bigcodebench.evaluate will be good enough to handle most cases.

What's Changed

add multiprocessing support for sanitization step by @sk-g in https://github.com/bigcode-project/bigcodebench/pull/37
Remove extra period in task BigCodeBench/16 by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/38
Await futures in progress checker by @hvaara in https://github.com/bigcode-project/bigcodebench/pull/48
A few args have been added to this version, including --direct_completion and --local_execute. See Advanced Usage for the details.

Dataset maintainence

The benchmark data has been bumped to v0.1.2. You can load the dataset with from datasets import load_data; ds = load_data("bigcode/bigcodebench", split="v0.1.2")
BigCodeBench/16: removed period
BigCodeBench/37: added pandas requirement
BigCodeBench/178: removed urlib requirement
BigCodeBench/241: added required plot title
BigCodeBench/267: added required plot title
BigCodeBench/760: changed the import of datetime
BigCodeBench/1006: replaced test links due to the potential connection block

New Contributors

@sk-g made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/37
@hvaara made their first contribution in https://github.com/bigcode-project/bigcodebench/pull/38

Evaluated LLMs (139 models)

o1-Preview-2024-09-12 (temperature=1)
Gemini-1.5-Pro-002
Llama-3.1 models
DeepSeek-V2.5
Qwen-2.5 models
Qwen-2.5-Coder models
and more

PyPI: https://pypi.org/project/bigcodebench/0.2.0.post3/

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.9...v0.2.0.post3

- Python
Published by terryyz over 1 year ago

bigcodebench - Release BigCodeBench v0.1.9

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.9

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.8

Features: - Support BigCodeBench-Hard subset: https://github.com/bigcode-project/bigcodebench/pull/17 - Identify and fix tokenizer setup: https://github.com/bigcode-project/bigcodebench/issues/21 - Customize the tokenizer: https://github.com/bigcode-project/bigcodebench/pull/20 - Add the pass rate result log: https://github.com/bigcode-project/bigcodebench/pull/20

Contributors: - @marianna13: https://github.com/bigcode-project/bigcodebench/pull/20

Models： - A total of 96 models at the time of the release

Acknowledgement: - @ethanc8 - @takkyu2 - @imamnurby

Full Changelog: https://github.com/bigcode-project/bigcodebench/compare/v0.1.8...v0.1.8

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release v0.1.7.post2

Enhanced the calculation of ground truth pass rate, and addressed the issue mentioned in https://github.com/bigcode-project/bigcodebench/pull/12#issuecomment-2199186199.
Update the README docs.

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.7

Fix some identified issues: - The ground truth pass rate was not previously computed in the correct way. - Passed RAM limits would raise errors, as they were set as float type. - User permission is not correctly set up in the Evaluate Docker.

Features: -- check-gt-only will print out the pass rate when finishing.

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.6

New features;

The RAM setup is now adjustable via specific arguments.
Parallel ground truth checking is supported. Potentially failed checks are skipped during execution. A warning will be issued if the ground truth pass rate falls below 0.95.

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release BigCodeBench v0.1.5

New features;

The data is downloaded from HF hub by default.
Data formats have been unified for the one on HF and the one on GitHub.

- Python
Published by terryyz almost 2 years ago

bigcodebench - BigCodeBench v0.1.2

- Python
Published by terryyz almost 2 years ago

bigcodebench - Release v0.1.0

- Python
Published by terryyz about 2 years ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

Recent Releases of bigcodebench

bigcodebench - Release BigCodeBench v0.2.5

bigcodebench - Release BigCodeBench v0.2.4

What's Changed

New Contributors

bigcodebench - Release BigCodeBench v0.2.3.post1

What's Changed

Evaluated LLMs (173 models)

bigcodebench - v0.2.1.post7

What's Changed

Evaluated LLMs (157 models)

bigcodebench - BigCodeBench v0.2.1.post2

What's Changed

Evaluated LLMs (144 models)

bigcodebench - Release BigCodeBench v0.2.0

Breaking Change

What's Changed

Dataset maintainence

New Contributors

Evaluated LLMs (139 models)

bigcodebench - Release BigCodeBench v0.1.9

bigcodebench - Release BigCodeBench v0.1.8

bigcodebench - Release v0.1.7.post2

bigcodebench - Release BigCodeBench v0.1.7

bigcodebench - Release BigCodeBench v0.1.6

bigcodebench - Release BigCodeBench v0.1.5

bigcodebench - BigCodeBench v0.1.2

bigcodebench - Release v0.1.0