https://github.com/amazon-science/repoformer

Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)

Basic Info

Host: GitHub
Owner: amazon-science
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://repoformer.github.io
Size: 106 MB

Statistics

Stars: 55
Watchers: 3
Forks: 10
Open Issues: 7
Releases: 0

Created about 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct

Repoformer

This repository contains the data and inference code of the ICML 2024 paper "Repoformer: Selective Retrieval for Repository-Level Code Completion."

Work done by Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, Xiaofei Ma.

Requirements

Install all dependencies: pip install -r requirements.txt
Build tree sitter: bash scripts/build_treesitter.sh
Prepare RepoEval data:
- cd repo_eval/data
- bash download.sh
- bash prepare.sh
- cd ../cfc_retrieval
- bash run.sh (Use Jaccard similarity by default. Uncomment the other lines to use other retrievers.)
Prepare CrossCodeEval data:
- cd cceval/data
- bash prepare_data.sh

Training data creation

We start from preprocessed repositories from the stack. To reproduce our data creation strategy, you can prepare the repositories in the following format: { "repo_name": "...", "stars_count": 100, "files": [ { "filepath": "...", "content": "", "metadata": { "size": 100, "lang": "Python", "ext": "py", "hexsha": "...", "avg_line_length": 22, "max_line_length": 47, "line_count": 19, "non_empty_line_count": 16, "imports": [ "..." # all import names ], "local_imports": [ "..." # local import names ] } }, ... # one entry per file ] "repo_size": { "number_of_files": 30, "lines_of_code": 900 } }

Starting from the raw file raw.jsonl, the data sampling algorithm contains three steps: blank sampling, RAG simulation, and data merging.

Step 1: blank sampling

``` cd finetuning/data_creation/

for creating chunk completion data

python 1createchunk.py --lang [python/java/csharp/javascript] --inputjson raw.jsonl --poissonlambda 3.0 --numprocesses 20 --clusterratio 0.1 --shardsize 500 [--oraclein_query]

for creating function completion data

python 1createfunction.py --lang [python/java/csharp/javascript] --inputjson raw.jsonl --poissonlambda 3.0 --numprocesses 20 --clusterratio 0.1 --shardsize 500 [--oracleinquery] ``Note that the--oracleinqueryflag uses the target line for retrieving the relevant contexts. In the paper, half of the data is created with--oracleinqueryand half is created without it. We output data in shards to make downstream processing easier. The--shardsize` parameter controls the size of each shard.

Step 2: RAG simulation

Suppose the previous step's outputs are named as chunk_shardx/sample_for_completion.jsonl and function_shardx/sample_for_completion.jsonl. To obtain the label for Repoformer, we run inference twice: once with the retrieved context and once without.

``` cd finetuning/datacreation/2labeling

for labeling the chunk completion data

bash runchunklrcontext.sh starcoderbase-1b chunkshardx/ bash runchunkrcfclrg1.sh starcoderbase-1b chunk_shardx/

for labeling the function completion data

bash runfunctionlrcontext.sh starcoderbase-1b functionshardx/ bash runfunctionrcfclrg1.sh starcoderbase-1b function_shardx/

```

Step 3: data merging

After step 2, the model outputs and scores should be stored in chunk_shardx/logs and function_shardx/logs . You can run the following command to get the final data.

```

For generating the final chunk completion data. Function completion data is similar.

python 3generatelabelleddata.py --rawfile chunkshardx/sampleforcompletion.jsonl --baselinescoresfile chunkshardx/logs/lrcontext/starcoderbase-1b/detailedresults.json --rg1scoresfile chunkshardx/logs/rcfclrg1/sparse/starcoderbase-1b/detailedresults.json --outputfile chunkshardx/datalabelled.jsonl --generationmodel starcoderbase-1b ```

Training Repoformer

Our training code is based on ContraCLM.

Step 1: tokenization

We tokenize the data into arrow format datasets. To run the code, move the files from step 3 into a separate folder and provide its path in finetuning/preprocess/run_preprocess_repoformer_cfcinrc.sh. Then, run the following command: cd finetuning/preprocess bash run_preprocess_repoformer_cfcinrc.sh Note that in this repo, <end_rc> corresponds to the <eof> token in the paper, and <cfc_info> corresponds to <cc>. Repoformer only need to add these two special tokens.

Step 2: running training

Before running the script, make sure to update finetuning/runscripts/run_repoformer_final_setting.sh with your preprocessed data path. cd finetuning/ bash runscripts/run_repoformer_final_setting.sh

Step 3: checkpoint postprocessing

After training, the deepspeed checkpoint will be stored in the last.ckpt folder. You can get the checkpoint in huggingface format with the following steps: - cd /path/to/last.ckpt/ - python zero_to_fp32.py . pytorch_model.bin.original - Update finetuning/evaluation/process_checkpoint_state_dict.py to point to the StarCoder model with the correct size. - python finetuning/evaluation/process_checkpoint_state_dict.py /path/to/last.ckpt/

Evaluation

Datasets

We release the newly created CrossCodeLongEval benchmark under the folder crosscodelongeval. You may run the process_data.sh to preprocess the data. In addition, we release the code to download and use Repoeval/CrossCodeEval in the folders repo_eval and cceval.

Baselines

To get the results of the baselines with or without repository-level retrieval, we recommend using the run_fim_hf.sh in the repo_eval and cceval folder. Sample command: bash run_fim_hf.sh model exp retriever - model: We support starcoderbase-1b/3b/7b and starcoder. You can easily evaluate on other code LMs you like by changing the model name. Note that if the LM does not perform fill-in-the-middle generation, the --use_fim_prompt flag needs to be dropped. - exp: the prompting strategy. There are four possible settings. lrcontext and rcfcl_rg1 are the two settings used in the Repoformer paper. - baseline: left context only. - lrcontext: left context + right context. - rg1: left context + retrieved cross-file context. - rcfcl_rg1: left context + right context + retrieved cross-file context.
- retriever: - For RepoEval, we support sparse (Jaccard similarity) and unixcoder. - For CCEval and CrossCodeLongEval, we support bm25, openai_cosine_sim, and unixcoder_cosine_sim.

We also support vllm for inference. For vllm, you would need torch 2.x. The other requirements are the same as in requirements.txt.

Repoformer

After converting the checkpoint, you can run the evaluation directly using the followng commands. ``` cd finetuning/evaluation

evaluate on RepoEval

bash run_repoeval.sh

evaluate on CCEval

bash run_cceval.sh ```

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Issues event: 2
Watch event: 25
Member event: 1
Issue comment event: 4
Push event: 4
Pull request event: 4
Fork event: 5

Last Year

Issues event: 2
Watch event: 25
Member event: 1
Issue comment event: 4
Push event: 4
Pull request event: 4
Fork event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 6
Total pull requests: 6
Average time to close issues: about 4 hours
Average time to close pull requests: 3 days
Total issue authors: 6
Total pull request authors: 4
Average comments per issue: 0.83
Average comments per pull request: 0.33
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 3
Pull requests: 2
Average time to close issues: about 4 hours
Average time to close pull requests: 1 minute
Issue authors: 3
Pull request authors: 2
Average comments per issue: 0.67
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

gaotongxue12138 (1)
sxthunder (1)
jswjc555 (1)
errllxj (1)
keezen (1)
wyt2000 (1)

Pull Request Authors

xiaowu0162 (8)
dependabot[bot] (1)
kezhen1 (1)
niksukhorukov (1)

Top Labels

Issue Labels

Pull Request Labels

dependencies (1)

Dependencies

requirements.txt pypi

accelerate *
bitsandbytes *
codebleu *
datasets *
deepspeed ==0.6.7
editdistance *
fuzzywuzzy *
gputil *
jsonlines *
nltk *
numpy ==1.22.3
pytorch-lightning ==1.6.5
rank-bm25 *
sacrebleu *
scikit-learn *
sentencepiece *
tensorboard *
timeout-decorator *
torch <2
transformers ==4.28.0
tree-sitter *

https://github.com/amazon-science/repoformer

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Repoformer

Requirements

Training data creation

Step 1: blank sampling

for creating chunk completion data

for creating function completion data

Step 2: RAG simulation

for labeling the chunk completion data

for labeling the function completion data

Step 3: data merging

For generating the final chunk completion data. Function completion data is similar.

Training Repoformer

Step 1: tokenization

Step 2: running training

Step 3: checkpoint postprocessing

Evaluation

Datasets

Baselines

Repoformer

evaluate on RepoEval

evaluate on CCEval

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies