https://github.com/cgcl-codes/naturalcc
NaturalCC: An Open-Source Toolkit for Code Intelligence
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.3%) to scientific vocabulary
Keywords
Repository
NaturalCC: An Open-Source Toolkit for Code Intelligence
Basic Info
- Host: GitHub
- Owner: CGCL-codes
- License: mit
- Language: Python
- Default Branch: main
- Homepage: http://xcodemind.github.io
- Size: 240 MB
Statistics
- Stars: 304
- Watchers: 11
- Forks: 56
- Open Issues: 23
- Releases: 0
Topics
Metadata Files
README.md
📖 Vision
NaturalCC is a sequence modeling toolkit designed to bridge the gap between programming and natural languages through advanced machine learning techniques. It allows researchers and developers to train custom models for a variety of software engineering tasks, e.g., code generation, code completion, code summarization, code retrieval, code clone detection, and type inference.
🌟 Key Features:
- Modular and Extensible Framework: Built on the robust Fairseq's registry mechanism, allowing for easy adaptation and extension to diverse software engineering tasks.
- Datasets and Preprocessing Tools+: Offers access to a variety of clean, preprocessed benchmarks such as Human-Eval, CodeSearchNet, Python-Doc, and Py150. Comes equipped with scripts for feature extraction using compiler tools like LLVM.
- Support for Large Code Models: Incorporates state-of-the-art large code models like Code Llama, CodeT5, CodeGen, and StarCoder.
- Benchmarking and Evaluation: Benchmarks multiple downstream tasks (including code generation and code completion), with evaluation capabilities on well-known benchmarks using popular metrics like pass@k.
- Optimized for Efficiency: Employs the
NCCLlibrary andtorch.distributedfor high-efficiency model training across multiple GPUs. Supports both full-precision (FP32) and half-precision (FP16) computations to accelerate training and inference processes. - Enhanced Logging for Improved Debugging: Advanced logging features to provide clear, detailed feedback during model training and operation, aiding in debugging and performance optimization.
✨ Latest News
- [Nov 25, 2023] NaturalCC 2.0 Released! Now compatible with Transformers and supporting popular large code models like Code Llama, CodeT5, CodeGen, and StarCoder from Hugging Face. Access the previous version in the ncc1 branch.
- [Apr 19, 2023] Integrated the source code of "You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search" into NaturalCC.
- [Jan 25, 2022] Our paper introducing the NaturalCC toolkit was accepted at the ICSE 2022 Demo Track.
- [May 10, 2022] Merged the source code of "What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code" into NaturalCC.
🛠️ Installation Guide
To get started with NaturalCC, ensure your system meets the following requirements:
- GCC/G++ version 5.0 or higher
- NVIDIA GPU, NCCL, and the Cuda Toolkit for training new models (optional but recommended)
- NVIDIA's apex library for faster training (optional)
Follow these steps to set up the environment.
(Optional) Creating conda environment
shell conda create -n naturalcc python=3.6 conda activate naturalccBuilding NaturalCC from source
shell git clone https://github.com/CGCL-codes/naturalcc && cd naturalcc pip install -r requirements.txt cd src pip install --editable ./Installing Additional Dependencies
shell conda install conda-forge::libsndfile pip install -q -U git+https://github.com/huggingface/transformers.git pip install -q -U git+https://github.com/huggingface/accelerate.gitHuggingFace Token for Certain Models
For models like StarCoder, a HuggingFace token is required. Log in to HuggingFace using:
huggingface-cli login
🚀 Quick Start
Example 1: Code Generation
Download the model checkpoint
First, download the checkpoint of a specific large code model. For this example, we use Codellama-7B.
Prepare the testing dataset
Create a JSON file containing your test cases in the following format:
json [ {"input": "this is a"}, {"input": "from tqdm import"}, {"input": "def calculate("}, {"input": "a = b**2"}, {"input": "torch.randint"}, {"input": "x = [1,2"} ]Running the code generation scripts
Initialize the task with the specific model and GPU device:
python print('Initializing GenerationTask') task = GenerationTask(task_name="codellama_7b_code", device="cuda:0")Load the downloaded checkpoint into the task. Replace
ckpt_pathwith the path to your downloaded checkpoint:python print('Loading model weights [{}]'.format(ckpt_path)) task.from_pretrained(ckpt_path)Load your dataset. Replace
dataset_pathwith the path to your dataset file:python print('Processing dataset [{}]'.format(dataset_path)) task.load_dataset(dataset_path)Run the model and output the results. Replace
output_pathwith your desired output file path:python task.run(output_path=output_path, batch_size=1, max_length=50) print('Output file: {}'.format(output_path))
Example 2: Code Summarization
Download and process a dataset from
datasets, and follow the instructions from the README.md file. ```shellref: dataset/python_wan/README.md
download dataset
bash dataset/python_wan/download.sh
clean data
python -m dataset.python_wan.clean
cast data attributes into different files
python -m dataset.pythonwan.attributescast
ref: dataset/python_wan/summarization/README.md
save code tokens and docstirng tokens into MMAP format
python -m dataset.python_wan.summarization.preprocess ```
Register your self-defined models
- If you want to create a new model, please add your model at
ncc/modelsandncc/modules.
- If you want to create a new model, please add your model at
- If your training policy are more complex than we thought, you should update your criterions and training procedure at ```ncc/criterions``` and ```ncc/trainers```, respectively.
<br>
Do not forget to update your self defined module at ```ncc/XX/__init__.py```.
- Training and inference.
- Select a task and a model from task list and follow the instructions in its README.md to start your learning.
shell # ref: run/summarization/transformer/README.md # train CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 & # inference CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt
- Select a task and a model from task list and follow the instructions in its README.md to start your learning.
We also have more detailed READMEs to start your tutorial of NaturalCC.
📚 Dataset
NaturalCC supports a diverse range of datasets, catering to various aspects of code analysis and processing. These datasets include:
- Python (Wan et al.)
- CodeSearchNet (Husain et al.)
- CodeXGlue (Feng et al.)
- Py150 (official processed) (raw)
- OpenCL (Grewe et al.)
- Java (Hu et, al.)
- Stack Overflow
- DeepCS (Gu et al.)
- AVATAR (Ahmad et al.)
- StackOverflow (Iyer et al.)
🤝 Contributor
We warmly welcome contributions to NaturalCC! Your involvement is essential for keeping NaturalCC innovative and accessible.
We're grateful to all our amazing contributors who have made this project what it is today!
💡 FAQ
If you have any questions or encounter issues, please feel free to reach out. For quick queries, you can also check our Issues page for common questions and solutions.
😘 License and Acknowledgement
License: NaturalCC is open-sourced under the MIT-licensed. This permissive license applies not only to the toolkit itself but also to the pre-trained models provided within.
Acknowledgements: We extend our heartfelt gratitude to the broader open-source community, particularly drawing inspiration from projects like Fairseq for their advanced sequence-to-sequence models, and AllenNLP for their robust NLP components. Their groundbreaking work has been instrumental in shaping the development of NaturalCC.
📄 Citation
We're thrilled that you're interested in using NaturalCC for your research or applications! Citing our work helps us to grow and continue improving this toolkit. You can find more in-depth details about NaturalCC in our paper.
If you use NaturalCC in your research, please consider citing our paper. Below is the BibTex format for citation:
@inproceedings{wan2022naturalcc,
title={NaturalCC: An Open-Source Toolkit for Code Intelligence},
author={Yao Wan and Yang He and Zhangqian Bi and Jianguo Zhang and Yulei Sui and Hongyu Zhang and Kazuma Hashimoto and Hai Jin and Guandong Xu and Caiming Xiong and Philip S. Yu},
booktitle={Proceedings of 44th International Conference on Software Engineering, Companion Volume},
publisher=ACM,
year={2022}
}
Owner
- Name: CGCL-codes
- Login: CGCL-codes
- Kind: organization
- Website: http://grid.hust.edu.cn/
- Repositories: 35
- Profile: https://github.com/CGCL-codes
CGCL/SCTS/BDTS Lab
GitHub Events
Total
- Issues event: 9
- Watch event: 45
- Issue comment event: 15
- Member event: 4
- Push event: 61
- Pull request event: 4
- Fork event: 11
- Create event: 1
Last Year
- Issues event: 9
- Watch event: 45
- Issue comment event: 15
- Member event: 4
- Push event: 61
- Pull request event: 4
- Fork event: 11
- Create event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 6
- Total pull requests: 3
- Average time to close issues: 12 months
- Average time to close pull requests: 4 minutes
- Total issue authors: 6
- Total pull request authors: 3
- Average comments per issue: 1.5
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 3
- Average time to close issues: 25 days
- Average time to close pull requests: 4 minutes
- Issue authors: 4
- Pull request authors: 3
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- NielsRogge (1)
- ChesterDu (1)
- adityabasarkar (1)
- nazmul-md (1)
- Oseghale360 (1)
- ktrk115 (1)
- Chrissa2009 (1)
- imogenxingren (1)
- cyy12345649 (1)
Pull Request Authors
- hanqihong (1)
- qingfengyuhuoda (1)
- jiang1233363 (1)
- stbst1 (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- 317 dependencies
- @types/mocha ^10.0.3 development
- @types/node 18.x development
- @types/vscode ^1.84.0 development
- @typescript-eslint/eslint-plugin ^6.9.0 development
- @typescript-eslint/parser ^6.9.0 development
- @vscode/test-electron ^2.3.6 development
- eslint ^8.52.0 development
- glob ^10.3.10 development
- mocha ^10.2.0 development
- ts-loader ^9.5.0 development
- typescript ^5.2.2 development
- webpack ^5.89.0 development
- webpack-cli ^5.1.4 development
- axios ^1.6.2
- node-fetch ^3.3.2
- fairseq2n *
- jiwer *
- numpy *
- overrides *
- packaging *
- pyyaml *
- sacrebleu *
- torch >=1.12.1
- torcheval *
- tqdm *
- typing_extensions *
- Deprecated *
- boto3 *
- cython *
- dgl-cu102 *
- dpu_utils *
- filelock *
- gdown *
- gpustat *
- h5py *
- jsbeautifier *
- jsonlines *
- loguru *
- mkdocs *
- nltk *
- numba *
- pathos *
- requests *
- rouge *
- ruamel.yaml *
- sentencepiece *
- tables *
- tokenizers *
- tqdm *
- transformers *
- tree-sitter ==0.2.2
- wget *
- cffi *
- cython *
- numpy *
- regex *
- sacrebleu *
- torch *
- tqdm *
- Deprecated *
- boto3 *
- cython *
- dgl-cu102 *
- dpu_utils *
- filelock *
- gdown *
- gpustat *
- h5py *
- jsbeautifier *
- jsonlines *
- loguru *
- mkdocs *
- nltk *
- numba *
- pathos *
- requests *
- rouge *
- ruamel.yaml *
- sentencepiece *
- tables *
- tokenizers *
- tqdm *
- transformers *
- tree-sitter ==0.2.2
- wget *
- Deprecated *
- absl-py *
- boto3 *
- colorlog *
- cython *
- docker *
- dpu_utils *
- filelock *
- gdown *
- gpustat *
- h5py *
- jsbeautifier *
- jsonlines *
- loguru *
- mkdocs *
- nltk *
- numba *
- pandas *
- pathos *
- pynvml *
- requests *
- rouge *
- ruamel.yaml *
- sentencepiece *
- tables *
- tokenizers *
- tqdm *
- transformers *
- tree-sitter ==0.19.0
- ujson *
- wget *
- zenodo-client *
- zenodo-get ==1.0.0
- huggingface_hub ==0.17.3
- numpy ==1.26.4
- openai ==1.12.0
- pandas ==2.2.0
- psycopg2_binary ==2.9.9
- pylint ==3.0.3
- python_Levenshtein ==0.23.0
- python_Levenshtein ==0.25.0
- regex ==2023.10.3
- scipy ==1.12.0
- tiktoken ==0.5.2
- torch ==2.1.0
- tqdm ==4.66.1
- transformers ==4.35.1
- tree_sitter ==0.20.4
- numpy *
- sklearn *
- torch *
- torchvision *
- tqdm *
- 218 dependencies
- @types/mocha ^10.0.10 development
- @types/node 20.x development
- @types/vscode ^1.100.0 development
- @vscode/test-cli ^0.0.10 development
- @vscode/test-electron ^2.5.2 development
- eslint ^9.25.1 development
- aiofiles *
- aiohttp *
- bs4 *
- cssutils *
- datasets *
- pillow *
- playwright *
- pyquery *
- requests ==2.27
- scikit-image *
- scikit-learn *
- tinycss2 *
- torch *
- tqdm *
- transformers *
- urllib3 ==1.26.11
- wandb *
- appdirs ==1.4.4
- asttokens ==2.4.1
- certifi ==2023.11.17
- charset-normalizer ==3.3.2
- click ==8.1.7
- comm ==0.2.1
- debugpy ==1.8.0
- decorator ==5.1.1
- docker-pycreds ==0.4.0
- exceptiongroup ==1.2.0
- executing ==2.0.1
- filelock ==3.13.1
- fsspec ==2023.12.2
- gitdb ==4.0.11
- gitpython ==3.1.41
- huggingface-hub ==0.20.2
- idna ==3.6
- ipykernel ==6.28.0
- ipython ==8.20.0
- jedi ==0.19.1
- jinja2 ==3.1.3
- joblib ==1.3.2
- jupyter-client ==8.6.0
- jupyter-core ==5.7.1
- markupsafe ==2.1.3
- matplotlib-inline ==0.1.6
- mpmath ==1.3.0
- nest-asyncio ==1.5.8
- networkx ==3.2.1
- nltk ==3.8.1
- numpy ==1.26.3
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu12 ==8.9.2.26
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-nccl-cu12 ==2.18.1
- nvidia-nvjitlink-cu12 ==12.3.101
- nvidia-nvtx-cu12 ==12.1.105
- packaging ==23.2
- parso ==0.8.3
- pexpect ==4.9.0
- pillow ==10.2.0
- platformdirs ==4.1.0
- prompt-toolkit ==3.0.43
- protobuf ==4.25.2
- psutil ==5.9.7
- ptyprocess ==0.7.0
- pure-eval ==0.2.2
- pygments ==2.17.2
- python-dateutil ==2.8.2
- pyyaml ==6.0.1
- pyzmq ==25.1.2
- regex ==2023.12.25
- requests ==2.31.0
- safetensors ==0.4.1
- sentry-sdk ==1.39.2
- setproctitle ==1.3.3
- six ==1.16.0
- smmap ==5.0.1
- stack-data ==0.6.3
- sympy ==1.12
- tokenizers ==0.13.3
- torch ==2.1.2
- torchvision ==0.16.2
- tornado ==6.4
- tqdm ==4.66.1
- traitlets ==5.14.1
- transformers ==4.33.1
- triton ==2.1.0
- typing-extensions ==4.9.0
- urllib3 ==2.1.0
- wandb ==0.16.2
- wcwidth ==0.2.13
- appdirs ==1.4.4
- asttokens ==2.4.1
- certifi ==2023.11.17
- charset-normalizer ==3.3.2
- click ==8.1.7
- comm ==0.2.1
- debugpy ==1.8.0
- decorator ==5.1.1
- docker-pycreds ==0.4.0
- exceptiongroup ==1.2.0
- executing ==2.0.1
- filelock ==3.13.1
- fsspec ==2023.12.2
- gitdb ==4.0.11
- gitpython ==3.1.41
- huggingface-hub ==0.20.2
- idna ==3.6
- ipykernel ==6.28.0
- ipython ==8.20.0
- jedi ==0.19.1
- jinja2 ==3.1.3
- joblib ==1.3.2
- jupyter-client ==8.6.0
- jupyter-core ==5.7.1
- markupsafe ==2.1.3
- matplotlib-inline ==0.1.6
- mpmath ==1.3.0
- nest-asyncio ==1.5.8
- networkx ==3.2.1
- nltk ==3.8.1
- numpy ==1.26.3
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu12 ==8.9.2.26
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-nccl-cu12 ==2.18.1
- nvidia-nvjitlink-cu12 ==12.3.101
- nvidia-nvtx-cu12 ==12.1.105
- packaging ==23.2
- parso ==0.8.3
- pexpect ==4.9.0
- pillow ==10.2.0
- platformdirs ==4.1.0
- prompt-toolkit ==3.0.43
- protobuf ==4.25.2
- psutil ==5.9.7
- ptyprocess ==0.7.0
- pure-eval ==0.2.2
- pygments ==2.17.2
- python-dateutil ==2.8.2
- pyyaml ==6.0.1
- pyzmq ==25.1.2
- regex ==2023.12.25
- requests ==2.31.0
- safetensors ==0.4.1
- sentry-sdk ==1.39.2
- setproctitle ==1.3.3
- six ==1.16.0
- smmap ==5.0.1
- stack-data ==0.6.3
- sympy ==1.12
- tokenizers ==0.13.3
- torch ==2.1.2
- torchvision ==0.16.2
- tornado ==6.4
- tqdm ==4.66.1
- traitlets ==5.14.1
- transformers ==4.33.1
- triton ==2.1.0
- typing-extensions ==4.9.0
- urllib3 ==2.1.0
- wandb ==0.16.2
- wcwidth ==0.2.13