https://github.com/amazon-science/llm-code-preference

Training and Benchmarking LLMs for Code Preference.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Keywords

code-generation llm-evaluation llm-training llms-benchmarking

Last synced: 5 months ago · JSON representation

Repository

Training and Benchmarking LLMs for Code Preference.

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Homepage: https://llm-code-preference.github.io/
Size: 755 KB

Statistics

Stars: 33
Watchers: 3
Forks: 2
Open Issues: 0
Releases: 0

Topics

code-generation llm-evaluation llm-training llms-benchmarking

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Code of conduct

Learning Code Preference via Synthetic Evolution

📰 TL;DR • 🔎 Evaluation • 🧪 Training • 🔮 Synthetic Data Generation • 📜 Citation • 🙏 Acknowledgement

📰 TL;DR

How to effectively and efficiently obtain code preferences and judgements is an important yet under-studied topic!

To this end, our work provides:

CodeFavor: an open recipe to train code preference models with from-scratch data!
- Commit-Instruct: code commits -> code preference
- Critic-Evol: code critique & revising -> code preference
CodePrefBench: 1364 code preference tasks covering both verifiable and human objectives:
- Code Correctness
- Code Efficiency
- Code Security
- Human Preference
Study: our paper provides comprehensive studies!
- Human studies: quantifying the cost and performance of human preference based 18 developers
- Case studies: our Appendix case-studies model preferences over code correctness, efficiency, and security
- Controlled experiments: impact of data, comment, criteria, modeling, etc. on training preference models

🔎 Evaluation

Environment

Python requirements: 3.10 or higher.

bash conda create -n codefavor python=3.10 -y conda activate codefavor pip install -r requirements.txt

CodePrefBench

```bash

OpenAI server

python codefavor/evaluate.py --model-id "gpt-4o-2024-05-13" --model-type openai --concurrency 80

Other OpenAI-compatible servers (vLLM, DeepSeek APIs, etc.)

python codefavor/evaluate.py --model-id "google/gemma-2-27b-it" --model-type openai --concurrency 80 --model-url http://localhost:8000/v1

Claude models via Bedrock

python codefavor/evaluate.py --model-id "anthropic.claude-3-sonnet-20240229-v1:0" --model-type bedrock --concurrency 10

Pairwise RM

python codefavor/evaluate.py --model-id ./models/mix-cls-mistral-7b-itbs32ep1_lr5e-6-l3-70b/checkpoint-688 --model-type pair-rm ```

Supported --model-type: huggingface, openai, bedrock, pair-rm, and google

🧪 Training

Environment

```bash git clone https://github.com/axolotl-ai-cloud/axolotl.git axolotl-dep cd axolotl-dep

pip install torch==2.3.0 pip install packaging ninja wandb pip install -e '.[flash-attn,deepspeed]' ```

Use existing dataset

bash python scripts/axolotl/prepare_data.py \ --decomposed-dataset datasets/train/editpackft-Llama-3-70B-Instruct.commit_instruct.decompose.jsonl \ --judge-type classification --both-order python scripts/axolotl/prepare_data.py \ --decomposed-dataset datasets/train/Llama-3-8B-Instruct-SOSS.teacher.Llama-3-70B-Instruct.critic_evol.decompose.jsonl \ --judge-type classification --both-order

Train models using Axolotl

```bash accelerate launch -m axolotl.cli.train \ scripts/axolotl/recipe/gemma/cls-commit-instruct-from-llama3-70b.yaml \ --deepspeed scripts/axolotl/zero3.json

or use `torchrun` if your `accelerate` is complaining

torchrun --nprocpernode 8 -m axolotl.cli.train \ scripts/axolotl/recipe/gemma/cls-commit-instruct-from-llama3-70b.yaml \ --deepspeed scripts/axolotl/zero3.json ```

🔮 Synthetic Data Generation

Commit-Instruct from Scratch

```bash

Support OpenAI and Bedrock interface

OAI interface

python codefavor/prompt/commit_instruct.py --model-id "deepseek-chat" --model-type "openai" --concurrency 256 --dataset editpackft --model-url "https://api.deepseek.com/v1"

Bedrock interface

python codefavor/prompt/commit_instruct.py --model-id "meta.llama3-1-405b-instruct-v1:0" --model-type "bedrock" --concurrency 10 --dataset editpackft ```

Critic-Evol from Scratch

bash python codefavor/prompt/critic_evol.py --weak-dataset ./datasets/train/Llama-3-8B-Instruct-SOSS.jsonl \ --model-id "deepseek-coder" --model-url "https://api.deepseek.com/v1" python codefavor/prompt/critic_evol.py --weak-dataset ./datasets/train/Llama-3-8B-Instruct-SOSS.jsonl \ --model-id "meta.llama3-1-405b-instruct-v1:0" --concurrency 10

Pairwise training code is partially adopted from https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/pair-pm

📜 Citation

bibtex @article{codefavor, title = {Learning Code Preference via Synthetic Evolution}, author = {Liu, Jiawei and Nguyen, Thanh and Shang, Mingyue and Ding, Hantian and Li, Xiaopeng and Yu, Yu and Kumar, Varun and Wang, Zijian}, journal = {arXiv preprint arXiv:2410.03837}, year = {2024}, }

🙏 Acknowledgement

Our training code is partially adapted from RLHFlow.
Our evaluation code is partially adapted from RepoQA.
The seed corpus used in this paper comes from EditPackFT and Self-OSS-Instruct.

🎓 Research Use Only

This source code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper, but interested parties are encouraged to open an issue requesting open source community development.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Watch event: 32
Member event: 1
Push event: 8
Public event: 1
Fork event: 1

Last Year

Watch event: 32
Member event: 1
Push event: 8
Public event: 1
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 10
Total Committers: 5
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.4

Past Year

Commits: 10
Committers: 5
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.4

Top Committers

Name	Email	Commits
Thanh Nguyen	m**h@a**m	6
Zijian Wang	2****g	1
Thanh Nguyen	f**l@i**m	1
Jiawei Liu	j**u@g**m	1
Amazon GitHub Automation	5****o	1

Committer Domains (Top 20 + Academic)

amazon.com: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/amazon-science/llm-code-preference

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Learning Code Preference via Synthetic Evolution

📰 TL;DR

🔎 Evaluation

Environment

CodePrefBench

OpenAI server

Other OpenAI-compatible servers (vLLM, DeepSeek APIs, etc.)

Claude models via Bedrock

Pairwise RM

🧪 Training

Environment

Use existing dataset

Train models using Axolotl

or use torchrun if your accelerate is complaining

🔮 Synthetic Data Generation

Commit-Instruct from Scratch

Support OpenAI and Bedrock interface

OAI interface

Bedrock interface

Critic-Evol from Scratch

📜 Citation

🙏 Acknowledgement

🎓 Research Use Only

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

or use `torchrun` if your `accelerate` is complaining