magicoder

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

https://github.com/ise-uiuc/magicoder

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 4 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary

Keywords

ai4code large-language-models llm llm4code

Keywords from Contributors

agent
Last synced: 4 months ago · JSON representation ·

Repository

[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct

Basic Info
Statistics
  • Stars: 2,017
  • Watchers: 25
  • Forks: 164
  • Open Issues: 4
  • Releases: 0
Topics
ai4code large-language-models llm llm4code
Created about 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README-DEV.md

Implementation Details of 🎩Magicoder

[!WARNING] This documentation is still WIP. Raise an issue in case you found any errors.

Data collection and generation

Make sure you have set up your OPENAI_API_KEY and optionally OPENAI_BASE_URL. Then run with

bash python src/magicoder/generate_data.py \ --seed_code_start_index ${START_INDEX_OF_RAW_DATA} \ --max_new_data ${MAX_DATA_TO_GENERATE} \ --data_dir python \ --tag python

To continue an interrupted run, use --continue_from flag:

bash python src/magicoder/generate_data.py \ --seed_code_start_index ${START_INDEX_OF_RAW_DATA} \ --max_new_data ${MAX_DATA_TO_GENERATE} \ --data_dir python \ --continue_from ${PATH_TO_DATA_FILE}

Data cleaning and decontamination

After the data collection, clean and decontaminate the data with the following command:

```bash python src/magicoder/cleandata.py --datafiles {PATHTODATAFILE} --outputfile {CLEANINGOUTPUTPATH}

python -m magicoder.decontamination.findsubstrings \ --datasetname "json" \ --outputfile ${DECONTAMOUTPUTPATH} \ --outputdir ${OUTPUTDIR} \ --columns problem solution \ --datafiles ${PATHTODATA_FILE} ```

You probably need to run this multiple times with different data files.

Data preprocessing

Before instruction tuning, let's reformat the data into instruction-response pairs:

bash python src/magicoder/preprocess_data.py \ --dataset_path json \ --data_files ${DECONTAM_OUTPUT_PATH} \ --output_file ${PREPROCESS_OUTPUT_PATH} \ --key src-instruct

After that, you can combine all the jsonl files into one.

Instruction tuning

Pointing the environment variable CUDA_VISIBLE_DEVICES to the GPUs you want to use, train the model with the following command to obtain Magicoder:

bash accelerate launch -m magicoder.train \ --model_key $MODEL_KEY \ --use_flash_attention True \ --max_training_seq_length 1216 \ --datafile_paths \ ${PATH_TO_OSS_INSTRUCT} \ --output_dir $MAGICODER_OUTPUT_DIR \ --bf16 True \ --num_train_epochs 2 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 128 \ --group_by_length False \ --ddp_find_unused_parameters False \ --logging_steps 1 \ --log_level info \ --optim adafactor \ --max_grad_norm -1 \ --warmup_steps 15 \ --learning_rate 5e-5 \ --lr_scheduler_type linear

To get Magicoder-S, continue the training with the following command:

bash accelerate launch -m magicoder.train \ --model_key $MODEL_KEY \ --model_name_or_path $MAGICODER_OUTPUT_DIR \ --use_flash_attention True \ --max_training_seq_length 1024 \ --datafile_paths \ ${PATH_TO_EVOL_INSTRUCT} \ --output_dir $MAGICODER_S_OUTPUT_DIR \ --bf16 True \ --num_train_epochs 2 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 128 \ --group_by_length False \ --ddp_find_unused_parameters False \ --logging_steps 1 \ --log_level info \ --optim adafactor \ --max_grad_norm -1 \ --warmup_steps 15 \ --learning_rate 5e-5 \ --lr_scheduler_type linear

Owner

  • Name: iSE-UIUC
  • Login: ise-uiuc
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: Magicoder
authors:
  - family-names: Magicoder Team
url: https://github.com/ise-uiuc/magicoder
doi: https://doi.org/10.48550/arXiv.2312.02120
date-released: 2023-05-01
license: MIT
preferred-citation:
  type: article
  title: "Magicoder: Source Code Is All You Need"
  authors:
    - family-names: Wei
      given-names: Yuxiang
    - family-names: Wang
      given-names: Zhe
    - family-names: Liu
      given-names: Jiawei
    - family-names: Ding
      given-names: Yifeng
    - family-names: Zhang
      given-names: Lingming
  year: 2023
  journal: "arXiv preprint arXiv:2312.02120"
  doi: https://doi.org/10.48550/arXiv.2312.02120
  url: https://arxiv.org/abs/2312.02120

GitHub Events

Total
  • Watch event: 75
  • Issue comment event: 2
  • Push event: 1
  • Pull request event: 1
  • Fork event: 6
Last Year
  • Watch event: 75
  • Issue comment event: 2
  • Push event: 1
  • Pull request event: 1
  • Fork event: 6

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 104
  • Total Committers: 4
  • Avg Commits per committer: 26.0
  • Development Distribution Score (DDS): 0.49
Past Year
  • Commits: 5
  • Committers: 1
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Yuxiang Wei y****i@g****m 53
natedingyifeng y****6@i****u 29
Zhe Wang 4****1 15
Jiawei Liu j****u@g****m 7
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 71
  • Total pull requests: 4
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 4 months
  • Total issue authors: 34
  • Total pull request authors: 2
  • Average comments per issue: 1.83
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 11
  • Pull requests: 0
  • Average time to close issues: about 23 hours
  • Average time to close pull requests: N/A
  • Issue authors: 6
  • Pull request authors: 0
  • Average comments per issue: 0.82
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • yucc-leon (3)
  • shatealaboxiaowang (3)
  • VoiceBeer (2)
  • wyt2000 (2)
  • Sairampv98 (2)
  • PilgrimMay (1)
  • monaj07 (1)
  • swtheing (1)
  • younesselbrag (1)
  • BecomeAllan (1)
  • FoxxComz (1)
  • Truth-In-Lies (1)
  • mmmrvector (1)
  • riddlevv (1)
  • imoneoi (1)
Pull Request Authors
  • natedingyifeng (2)
  • chenxwh (1)
Top Labels
Issue Labels
question (10) discussion (9) documentation (3) bug (2) enhancement (1) help wanted (1)
Pull Request Labels
enhancement (1)

Dependencies

pyproject.toml pypi
  • GitPython >=3.1.40
  • datasets >=2.14.6
  • numpy >=1.26.1
  • openai >=1.2.2
  • sentence-transformers >=2.2.2
  • tiktoken >=0.5.1
  • torch >=2.1.0
  • transformers >=4.35.0