magicoder
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 4 committers (25.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
[ICML'24] Magicoder: Empowering Code Generation with OSS-Instruct
Basic Info
- Host: GitHub
- Owner: ise-uiuc
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://proceedings.mlr.press/v235/wei24h.html
- Size: 2.4 MB
Statistics
- Stars: 2,017
- Watchers: 25
- Forks: 164
- Open Issues: 4
- Releases: 0
Topics
Metadata Files
README-DEV.md
Implementation Details of 🎩Magicoder
[!WARNING] This documentation is still WIP. Raise an issue in case you found any errors.
Data collection and generation
Make sure you have set up your OPENAI_API_KEY and optionally OPENAI_BASE_URL. Then run with
bash
python src/magicoder/generate_data.py \
--seed_code_start_index ${START_INDEX_OF_RAW_DATA} \
--max_new_data ${MAX_DATA_TO_GENERATE} \
--data_dir python \
--tag python
To continue an interrupted run, use --continue_from flag:
bash
python src/magicoder/generate_data.py \
--seed_code_start_index ${START_INDEX_OF_RAW_DATA} \
--max_new_data ${MAX_DATA_TO_GENERATE} \
--data_dir python \
--continue_from ${PATH_TO_DATA_FILE}
Data cleaning and decontamination
After the data collection, clean and decontaminate the data with the following command:
```bash python src/magicoder/cleandata.py --datafiles {PATHTODATAFILE} --outputfile {CLEANINGOUTPUTPATH}
python -m magicoder.decontamination.findsubstrings \ --datasetname "json" \ --outputfile ${DECONTAMOUTPUTPATH} \ --outputdir ${OUTPUTDIR} \ --columns problem solution \ --datafiles ${PATHTODATA_FILE} ```
You probably need to run this multiple times with different data files.
Data preprocessing
Before instruction tuning, let's reformat the data into instruction-response pairs:
bash
python src/magicoder/preprocess_data.py \
--dataset_path json \
--data_files ${DECONTAM_OUTPUT_PATH} \
--output_file ${PREPROCESS_OUTPUT_PATH} \
--key src-instruct
After that, you can combine all the jsonl files into one.
Instruction tuning
Pointing the environment variable CUDA_VISIBLE_DEVICES to the GPUs you want to use, train the model with the following command to obtain Magicoder:
bash
accelerate launch -m magicoder.train \
--model_key $MODEL_KEY \
--use_flash_attention True \
--max_training_seq_length 1216 \
--datafile_paths \
${PATH_TO_OSS_INSTRUCT} \
--output_dir $MAGICODER_OUTPUT_DIR \
--bf16 True \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 128 \
--group_by_length False \
--ddp_find_unused_parameters False \
--logging_steps 1 \
--log_level info \
--optim adafactor \
--max_grad_norm -1 \
--warmup_steps 15 \
--learning_rate 5e-5 \
--lr_scheduler_type linear
To get Magicoder-S, continue the training with the following command:
bash
accelerate launch -m magicoder.train \
--model_key $MODEL_KEY \
--model_name_or_path $MAGICODER_OUTPUT_DIR \
--use_flash_attention True \
--max_training_seq_length 1024 \
--datafile_paths \
${PATH_TO_EVOL_INSTRUCT} \
--output_dir $MAGICODER_S_OUTPUT_DIR \
--bf16 True \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 128 \
--group_by_length False \
--ddp_find_unused_parameters False \
--logging_steps 1 \
--log_level info \
--optim adafactor \
--max_grad_norm -1 \
--warmup_steps 15 \
--learning_rate 5e-5 \
--lr_scheduler_type linear
Owner
- Name: iSE-UIUC
- Login: ise-uiuc
- Kind: organization
- Repositories: 14
- Profile: https://github.com/ise-uiuc
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work and love it, consider citing it as below \U0001F917"
title: Magicoder
authors:
- family-names: Magicoder Team
url: https://github.com/ise-uiuc/magicoder
doi: https://doi.org/10.48550/arXiv.2312.02120
date-released: 2023-05-01
license: MIT
preferred-citation:
type: article
title: "Magicoder: Source Code Is All You Need"
authors:
- family-names: Wei
given-names: Yuxiang
- family-names: Wang
given-names: Zhe
- family-names: Liu
given-names: Jiawei
- family-names: Ding
given-names: Yifeng
- family-names: Zhang
given-names: Lingming
year: 2023
journal: "arXiv preprint arXiv:2312.02120"
doi: https://doi.org/10.48550/arXiv.2312.02120
url: https://arxiv.org/abs/2312.02120
GitHub Events
Total
- Watch event: 75
- Issue comment event: 2
- Push event: 1
- Pull request event: 1
- Fork event: 6
Last Year
- Watch event: 75
- Issue comment event: 2
- Push event: 1
- Pull request event: 1
- Fork event: 6
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Yuxiang Wei | y****i@g****m | 53 |
| natedingyifeng | y****6@i****u | 29 |
| Zhe Wang | 4****1 | 15 |
| Jiawei Liu | j****u@g****m | 7 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 71
- Total pull requests: 4
- Average time to close issues: about 1 month
- Average time to close pull requests: 4 months
- Total issue authors: 34
- Total pull request authors: 2
- Average comments per issue: 1.83
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 11
- Pull requests: 0
- Average time to close issues: about 23 hours
- Average time to close pull requests: N/A
- Issue authors: 6
- Pull request authors: 0
- Average comments per issue: 0.82
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- yucc-leon (3)
- shatealaboxiaowang (3)
- VoiceBeer (2)
- wyt2000 (2)
- Sairampv98 (2)
- PilgrimMay (1)
- monaj07 (1)
- swtheing (1)
- younesselbrag (1)
- BecomeAllan (1)
- FoxxComz (1)
- Truth-In-Lies (1)
- mmmrvector (1)
- riddlevv (1)
- imoneoi (1)
Pull Request Authors
- natedingyifeng (2)
- chenxwh (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- GitPython >=3.1.40
- datasets >=2.14.6
- numpy >=1.26.1
- openai >=1.2.2
- sentence-transformers >=2.2.2
- tiktoken >=0.5.1
- torch >=2.1.0
- transformers >=4.35.0