https://github.com/amazon-science/buggy-aware-codelm

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Size: 478 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Code of conduct

Fine-tuning Language Models for Joint Rewriting and Completion of Code with Potential Bugs

This is the experiment code for our ACL 2024 (Findings) paper "Fine-tuning Language Models for Joint Rewriting and Completion of Code with Potential Bugs".

Our constructed testing datasets are located in Benchmarks.

Buggy Dataset Generation

We provide a configuration file meta.json, where you can specify the original datasource path and the bug injection method you want to use:

```json { "datapath": { "codecontests": "/path/to/Buggy-Aware-CodeLM/datasets/raw/codecontests.jsonl" }, "method" : [ "OperatorChangeNodeVisitor", "NumericValueChangeNodeVisitor", "VariableRenamingNodeVisitor", "KeywordRemovalTransformer", "ConditionRemovalNodeVisitor", "BranchRemovalNodeVisitor", "WhileToIfTransformer" ], "savepath": "/path/to/Buggy-Aware-CodeLM/datasets/buggy" }

```

Then, you can run the following command to generate the buggy datasets, which will be save in the specified savepath in the meta.json file.

python python generate.py To prepare the dataset for the second-phase of training, begin by obtaining a fine-tuned model using the previously constructed training dataset. Once you have the fine-tuned model, you can commence the inference process with this model. Assuming the inference results are saved in a file named result_1.jsonl, you can generate the new training dataset by executing the following command:

python python truncate.py --original_file /path/to/Buggy-Aware-CodeLM/datasets/benchmarks/s_humaneval.jsonl --completion_file infill_line_completion_100_0-1894.jsonl --prefix iteration1 --save_path D_1.jsonl

Afterwards, you can proceed with further inference using both the fine-tuned model and the newly constructed dataset stored at savepath **D1.jsonl**. This will yield additional training data.

......

Finally, you can combine all newly-constructed datasets and continue with the process of fine-tuning the model.

Fine-tuning with DeepSpeed

deepspeed train_deepspeed.py \ --checkpoint base_model_path or the huggingface model name \ --deepspeed_checkpoint_dir deepspeed_save_path_dir \ --train_files training_datafile_path where --train_files argument supports passing multiple training files path. More argument choices could be found in train_deepspeed.py and ds_config.json.

Inference with mutliple GPUS

We provide a bash script to make full use of the avaliable GPUs to speed up the inference. The main idea is to equally split the inference data to all available GPUs and then combine the inference results together. In particular, we provide the following arguments:

baseline or finetune
the path for the dataset needed to be evaluated
checkpoint path
the directory for storing the results
number of samples we want to generate for each instance
number of gpus we want to use
datasetname, e.g. buggyhumaneval, buggyfixeval
sep token
whether using the header

you can run the following command according to the above positional arguments: bash bash inference.sh \ finetune \ /path/to/Buggy-Aware-CodeLM/datasets/benchmarks/fixeval_large_instances.jsonl \ /path/to/Buggy-Aware-CodeLM/checkpoints/hugginface_checkpoints/codegen-350M-mono \ /path/to/Buggy-Aware-CodeLM/results/finetune/codegen-350M-mono \ 1 \ 4 \ buggy_fixeval \ [SEP] \ 1 \

License

The code in this package is subject to Apache-2.0 License.

The testing datasets in this repo are subject to different licenses:

buggy-HumanEval files (datasets/benchmarks/b_humaneval*.jsonl) are released under the MIT License.
buggy-MBPP files (datasets/benchmarks/b_mbpp*.jsonl) are released under the CC-BY-4.0 license.

Citation

You are more than welcome to cite our paper: @inproceedings{wang2024fine, title={Fine-tuning Language Models for Joint Rewriting and Completion of Code with Potential Bugs}, author={Wang, Dingmin and Zhao, Jinman and Pei, Hengzhi and Tan, Samson and Zha, Sheng}, booktitle={Findings of the Association for Computational Linguistics ACL 2024}, pages={15854--15868}, year={2024} }

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

generator/meta.json cpan

requirements.txt pypi

datasets *
deepspeed *
fire *
mpi4py *
transformers *
wandb *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science