https://github.com/breandan/bifi

[ICML 2021] Break-It-Fix-It: Unsupervised Learning for Program Repair

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

[ICML 2021] Break-It-Fix-It: Unsupervised Learning for Program Repair

Basic Info

Host: GitHub
Owner: breandan
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 18.9 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Fork of michiyasunaga/BIFI

Created over 2 years ago · Last pushed 12 months ago

https://github.com/breandan/BIFI/blob/main/

# Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data This repo provides the source code & data of our paper: [Break-It-Fix-It: Unsupervised Learning for Program Repair](http://arxiv.org/abs/2106.06600) (ICML 2021). ```bib @InProceedings{yasunaga2021break, author = {Michihiro Yasunaga and Percy Liang}, title = {Break-It-Fix-It: Unsupervised Learning for Program Repair}, year = {2021}, booktitle = {International Conference on Machine Learning (ICML)}, } ``` **Problem: Repair Task**

**Our approach: BIFI**

## 0. Dependencies Run the following commands to create a conda environment (assuming CUDA10.1): ```bash conda create -n BIFI python=3.11 conda activate BIFI pip install tqdm pip install torch==2.2.0 torchvision rm -rf utils/fairseq cd utils git clone https://github.com/VarunGumma/fairseq.git cd fairseq pip install -e . pip install numpy editdistance ``` Alternatively, you can use the Dockerfile in the `docker` folder of this repo to set up the environment. ## 1. Download Data Download all the data from [here (`data.zip`)](https://nlp.stanford.edu/projects/myasu/BIFI/data.zip) and unzip it (note: 67GB when compressed, 400GB when decompressed). This includes the GitHub-Python dataset, and all the processed training data and trained models associated with BIFI. If you only want the original GitHub-Python dataset, you can download it from [here (`data_minimal.zip`; 1GB)](https://nlp.stanford.edu/projects/myasu/BIFI/data_minimal.zip). After unzipping the `data.zip`, the resulting file structure will look like: ```plain . README.md data/ orig_bad_code/ (GitHub-Python dataset's bad code) orig_good_code/ (GitHub-Python dataset's good code) round0/ data_paired (paired data used to train fixer in round0) model-fixer (fixer trained in round0) round1-BIFI-part1/ data_paired (paired data used to train breaker in BIFI round1) model-breaker (breaker trained in BIFI round1) round1-BIFI-part2/ data_paired (paired data used to train fixer in BIFI round1) model-fixer (fixer trained in BIFI round1) ... ``` ### About the GitHub-Python dataset We collected 3 million Python3 snippets from GitHub. Using the critic (Python AST parser), the code snippets are split into a set of bad code (with AST parse errors) and a set of good code (with no errors). The set of bad code is located at `data/orig_bad_code/orig.bad.json` and good code at `data/orig_good_code/orig.good.json`. Each entry of `orig.bad.json` or `orig.good.json` is a dictionary consisting of - **"code_string"**: raw code in the string format - **"code_toks_joined"**: the raw code is split into tokens by Python tokenizer, anonymized (string/number is replaced with special tokens ``/``), and then joined by whitespace. The tokenization was done by `utils/code_utils.py: tokenize_python_code()` - **"anonymize_dict"**: mapping betweens raw string/number and ``/`` so that "code_string" can be recovered from "code_toks_joined". This recovery can be done by `utils/code_utils.py: code_toks_to_code_string()` - **"err_obj"**: type of the error caught by the critic (e.g. unbalanced parentheses, indentation error). This is only applicable to `orig.bad.json`. The bad code snippets in `orig.bad.json` are split into 5 chunks (`orig.0.bad` to `orig.4.bad` in `data/orig_bad_code/`), where 3,4 is heldout as the test set and 0,1,2 is made available for BIFI training. This splitting was done by `scripts/split_orig_bad_and_good.py` ## 2. Training and Evaluation First, train the initial fixer by running commands in `src/run-round0.py` one by one. We then consider three training algorithms on top of it: **BIFI** (our proposed method), **FixerOnly** (BIFI without breaker), and **BackTranslation** (BT; our baseline). For each algorithm, - **BIFI**: run commands in `src/run-BIFI.py` one by one - **FixerOnly**: run commands in `src/run-FixerOnly.py` one by one - **BT**: run commands in `src/run-BT.py` one by one Below is an illustration for the case of BIFI. **run-round0.sh** ```bash export PYTHONPATH=. #Train initial fixer on synthetic paired data python src/c001__train_fixer.py --round_name round0 --gpu_id 0 --max_epoch 2 #Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic python src/c003__run_fixer.py --round_name round0 --gpu_ids '0,1,2,3,4' #Evaluate the fixer outputs on the test set (chunk 3,4) python src/c005__eval_fixer.py --round_name round0 ``` **run-BIFI.sh** (round 1) ```bash #Use the fixer outputs on the bad code (chunk 0,1,2) to get new paired data (Equation 6 in the paper) python src/c006__generate_paired_data_from_fixer.py --round_name round0 --out_round_name round1-BIFI-part1 #Train breaker on the new paired data (Equation 7 in the paper) python src/c002__train_breaker.py --round_name round1-BIFI-part1 --gpu_id 0 --max_epoch 3 #Run the trained breaker on the good code and get new paired data (Equation 8 in the paper) python src/c004__run_breaker.py --round_name round1-BIFI-part1 --gpu_ids '0,1,2,3,4' python src/c007__generate_paired_data_from_breaker.py --round_name round1-BIFI-part1 --out_round_name round1-BIFI-part2 #Train fixer on the new paired data (Equation 9 in the paper) python src/c001__train_fixer.py --round_name round1-BIFI-part2 --gpu_id 0 --max_epoch 2 --continue_from 'data/round0/model-fixer/checkpoint.pt' #Run the trained fixer on the bad code (chunk 0-4) and check the outputs by critic python src/c003__run_fixer.py --round_name round1-BIFI-part2 --gpu_ids '0,1,2,3,4' #Evaluate the fixer outputs on the test set (chunk 3,4) python src/c005__eval_fixer.py --round_name round1-BIFI-part2 ``` This is repeated similarly for round 2. # Evaluation Command ```bash fairseq-interactive data/round2-BIFI-part2/orig_bad/fairseq_preprocess__orig_bad.0 \ --path data/round2-BIFI-part2/model-fixer/checkpoint.pt \ --beam 10 ```

Owner

Name: breandan
Login: breandan
Kind: user

Website: http://brea.ndan.co
Twitter: breandan
Repositories: 185
Profile: https://github.com/breandan

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/breandan/bifi

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/breandan/BIFI/blob/main/

Owner

GitHub Events

Total

Last Year