https://github.com/amazon-science/repoformer

Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)

https://github.com/amazon-science/repoformer

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage: https://repoformer.github.io
  • Size: 106 MB
Statistics
  • Stars: 55
  • Watchers: 3
  • Forks: 10
  • Open Issues: 7
  • Releases: 0
Created about 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Repoformer

This repository contains the data and inference code of the ICML 2024 paper "Repoformer: Selective Retrieval for Repository-Level Code Completion."

Work done by Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, Xiaofei Ma.

Requirements

  • Install all dependencies: pip install -r requirements.txt
  • Build tree sitter: bash scripts/build_treesitter.sh
  • Prepare RepoEval data:
    • cd repo_eval/data
    • bash download.sh
    • bash prepare.sh
    • cd ../cfc_retrieval
    • bash run.sh (Use Jaccard similarity by default. Uncomment the other lines to use other retrievers.)
  • Prepare CrossCodeEval data:
    • cd cceval/data
    • bash prepare_data.sh

Training data creation

We start from preprocessed repositories from the stack. To reproduce our data creation strategy, you can prepare the repositories in the following format: { "repo_name": "...", "stars_count": 100, "files": [ { "filepath": "...", "content": "", "metadata": { "size": 100, "lang": "Python", "ext": "py", "hexsha": "...", "avg_line_length": 22, "max_line_length": 47, "line_count": 19, "non_empty_line_count": 16, "imports": [ "..." # all import names ], "local_imports": [ "..." # local import names ] } }, ... # one entry per file ] "repo_size": { "number_of_files": 30, "lines_of_code": 900 } }

Starting from the raw file raw.jsonl, the data sampling algorithm contains three steps: blank sampling, RAG simulation, and data merging.

Step 1: blank sampling

``` cd finetuning/data_creation/

for creating chunk completion data

python 1createchunk.py --lang [python/java/csharp/javascript] --inputjson raw.jsonl --poissonlambda 3.0 --numprocesses 20 --clusterratio 0.1 --shardsize 500 [--oraclein_query]

for creating function completion data

python 1createfunction.py --lang [python/java/csharp/javascript] --inputjson raw.jsonl --poissonlambda 3.0 --numprocesses 20 --clusterratio 0.1 --shardsize 500 [--oracleinquery] `` Note that the--oracleinqueryflag uses the target line for retrieving the relevant contexts. In the paper, half of the data is created with--oracleinqueryand half is created without it. We output data in shards to make downstream processing easier. The--shardsize` parameter controls the size of each shard.

Step 2: RAG simulation

Suppose the previous step's outputs are named as chunk_shardx/sample_for_completion.jsonl and function_shardx/sample_for_completion.jsonl. To obtain the label for Repoformer, we run inference twice: once with the retrieved context and once without.

``` cd finetuning/datacreation/2labeling

for labeling the chunk completion data

bash runchunklrcontext.sh starcoderbase-1b chunkshardx/ bash runchunkrcfclrg1.sh starcoderbase-1b chunk_shardx/

for labeling the function completion data

bash runfunctionlrcontext.sh starcoderbase-1b functionshardx/ bash runfunctionrcfclrg1.sh starcoderbase-1b function_shardx/

```

Step 3: data merging

After step 2, the model outputs and scores should be stored in chunk_shardx/logs and function_shardx/logs . You can run the following command to get the final data.

```

For generating the final chunk completion data. Function completion data is similar.

python 3generatelabelleddata.py --rawfile chunkshardx/sampleforcompletion.jsonl --baselinescoresfile chunkshardx/logs/lrcontext/starcoderbase-1b/detailedresults.json --rg1scoresfile chunkshardx/logs/rcfclrg1/sparse/starcoderbase-1b/detailedresults.json --outputfile chunkshardx/datalabelled.jsonl --generationmodel starcoderbase-1b ```

Training Repoformer

Our training code is based on ContraCLM.

Step 1: tokenization

We tokenize the data into arrow format datasets. To run the code, move the files from step 3 into a separate folder and provide its path in finetuning/preprocess/run_preprocess_repoformer_cfcinrc.sh. Then, run the following command: cd finetuning/preprocess bash run_preprocess_repoformer_cfcinrc.sh Note that in this repo, <end_rc> corresponds to the <eof> token in the paper, and <cfc_info> corresponds to <cc>. Repoformer only need to add these two special tokens.

Step 2: running training

Before running the script, make sure to update finetuning/runscripts/run_repoformer_final_setting.sh with your preprocessed data path. cd finetuning/ bash runscripts/run_repoformer_final_setting.sh

Step 3: checkpoint postprocessing

After training, the deepspeed checkpoint will be stored in the last.ckpt folder. You can get the checkpoint in huggingface format with the following steps: - cd /path/to/last.ckpt/ - python zero_to_fp32.py . pytorch_model.bin.original - Update finetuning/evaluation/process_checkpoint_state_dict.py to point to the StarCoder model with the correct size. - python finetuning/evaluation/process_checkpoint_state_dict.py /path/to/last.ckpt/

Evaluation

Datasets

We release the newly created CrossCodeLongEval benchmark under the folder crosscodelongeval. You may run the process_data.sh to preprocess the data. In addition, we release the code to download and use Repoeval/CrossCodeEval in the folders repo_eval and cceval.

Baselines

To get the results of the baselines with or without repository-level retrieval, we recommend using the run_fim_hf.sh in the repo_eval and cceval folder. Sample command: bash run_fim_hf.sh model exp retriever - model: We support starcoderbase-1b/3b/7b and starcoder. You can easily evaluate on other code LMs you like by changing the model name. Note that if the LM does not perform fill-in-the-middle generation, the --use_fim_prompt flag needs to be dropped. - exp: the prompting strategy. There are four possible settings. lrcontext and rcfcl_rg1 are the two settings used in the Repoformer paper. - baseline: left context only. - lrcontext: left context + right context. - rg1: left context + retrieved cross-file context. - rcfcl_rg1: left context + right context + retrieved cross-file context.
- retriever: - For RepoEval, we support sparse (Jaccard similarity) and unixcoder. - For CCEval and CrossCodeLongEval, we support bm25, openai_cosine_sim, and unixcoder_cosine_sim.

We also support vllm for inference. For vllm, you would need torch 2.x. The other requirements are the same as in requirements.txt.

Repoformer

After converting the checkpoint, you can run the evaluation directly using the followng commands. ``` cd finetuning/evaluation

evaluate on RepoEval

bash run_repoeval.sh

evaluate on CCEval

bash run_cceval.sh ```

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Issues event: 2
  • Watch event: 25
  • Member event: 1
  • Issue comment event: 4
  • Push event: 4
  • Pull request event: 4
  • Fork event: 5
Last Year
  • Issues event: 2
  • Watch event: 25
  • Member event: 1
  • Issue comment event: 4
  • Push event: 4
  • Pull request event: 4
  • Fork event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 6
  • Total pull requests: 6
  • Average time to close issues: about 4 hours
  • Average time to close pull requests: 3 days
  • Total issue authors: 6
  • Total pull request authors: 4
  • Average comments per issue: 0.83
  • Average comments per pull request: 0.33
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 3
  • Pull requests: 2
  • Average time to close issues: about 4 hours
  • Average time to close pull requests: 1 minute
  • Issue authors: 3
  • Pull request authors: 2
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • gaotongxue12138 (1)
  • sxthunder (1)
  • jswjc555 (1)
  • errllxj (1)
  • keezen (1)
  • wyt2000 (1)
Pull Request Authors
  • xiaowu0162 (8)
  • dependabot[bot] (1)
  • kezhen1 (1)
  • niksukhorukov (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (1)

Dependencies

requirements.txt pypi
  • accelerate *
  • bitsandbytes *
  • codebleu *
  • datasets *
  • deepspeed ==0.6.7
  • editdistance *
  • fuzzywuzzy *
  • gputil *
  • jsonlines *
  • nltk *
  • numpy ==1.22.3
  • pytorch-lightning ==1.6.5
  • rank-bm25 *
  • sacrebleu *
  • scikit-learn *
  • sentencepiece *
  • tensorboard *
  • timeout-decorator *
  • torch <2
  • transformers ==4.28.0
  • tree-sitter *