codeassist

CodeAssist is an advanced code completion tool that provides high-quality code completions for Python, Java, C++ and so on. CodeAssist 是一个高级代码补全工具,高质量为 Python、Java 和 C++ 补全代码。

https://github.com/shibing624/codeassist

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary

Keywords

auto-completion code-autocomplete code-generation gpt-4 gpt2 starcoder wizardcoder
Last synced: 4 months ago · JSON representation ·

Repository

CodeAssist is an advanced code completion tool that provides high-quality code completions for Python, Java, C++ and so on. CodeAssist 是一个高级代码补全工具,高质量为 Python、Java 和 C++ 补全代码。

Basic Info
  • Host: GitHub
  • Owner: shibing624
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.06 MB
Statistics
  • Stars: 58
  • Watchers: 3
  • Forks: 8
  • Open Issues: 2
  • Releases: 4
Topics
auto-completion code-autocomplete code-generation gpt-4 gpt2 starcoder wizardcoder
Created almost 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Contributing License Citation

README.md

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

Logo

CodeAssist: Advanced Code Completion Tool

PyPI version Contributions welcome GitHub contributors License Apache 2.0 python_vesion GitHub issues Wechat Group

Introduction

CodeAssist is an advanced code completion tool that intelligently provides high-quality code completions for Python, Java, and C++ and so on.

CodeAssist 是一个高级代码补全工具,高质量为 Python、Java 和 C++ 等编程语言补全代码

Features

  • GPT based code completion
  • Code completion for Python, Java, C++, javascript and so on
  • Line and block code completion
  • Train(Fine-tuning) and predict model with your own data

Release Models

| Arch | BaseModel | Model | Model Size | |:-------|:------------------|:------------------------------------------------------------------------------------------------------------------------|:----------:| | GPT | gpt2 | shibing624/code-autocomplete-gpt2-base | 487MB | | GPT | distilgpt2 | shibing624/code-autocomplete-distilgpt2-python | 319MB | | GPT | bigcode/starcoder | WizardLM/WizardCoder-15B-V1.0 | 29GB |

Demo

HuggingFace Demo: https://huggingface.co/spaces/shibing624/code-autocomplete

backend model: shibing624/code-autocomplete-gpt2-base

Install

shell pip install torch # conda install pytorch pip install -U codeassist

or

shell git clone https://github.com/shibing624/codeassist.git cd CodeAssist python setup.py install

Usage

WizardCoder model

WizardCoder-15b is fine-tuned bigcode/starcoder with alpaca code data, you can use the following code to generate code:

example: examples/wizardcoder_demo.py

```python import sys

sys.path.append('..') from codeassist import WizardCoder

m = WizardCoder("WizardLM/WizardCoder-15B-V1.0") print(m.generate('def loadcsvfile(file_path):')[0]) ```

output:

```python import csv

def loadcsvfile(filepath): """ Load data from a CSV file and return a list of dictionaries. """ # Open the file in read mode with open(filepath, 'r') as file: # Create a CSV reader object csvreader = csv.DictReader(file) # Initialize an empty list to store the data data = [] # Iterate over each row of data for row in csvreader: # Append the row of data to the list data.append(row) # Return the list of data return data ```

model output is impressively effective, it currently supports English and Chinese input, you can enter instructions or code prefixes as required.

distilgpt2 model

distilgpt2 fine-tuned code autocomplete model, you can use the following code:

example: examples/distilgpt2_demo.py

```python import sys

sys.path.append('..') from codeassist import GPT2Coder

m = GPT2Coder("shibing624/code-autocomplete-distilgpt2-python") print(m.generate('import torch.nn as')[0]) ```

output:

shell import torch.nn as nn import torch.nn.functional as F

Use with huggingface/transformers:

example: examples/usetransformersgpt2.py

Train Model

Train WizardCoder model

example: examples/trainingwizardcodermydata.py

shell cd examples CUDA_VISIBLE_DEVICES=0,1 python training_wizardcoder_mydata.py --do_train --do_predict --num_epochs 1 --output_dir outputs-wizard --model_name WizardLM/WizardCoder-15B-V1.0

  • GPU memory: 31GB
  • finetune need 2*V100(32GB)
  • inference need 1*V100(32GB)

Train distilgpt2 model

example: examples/traininggpt2mydata.py

shell cd examples python training_gpt2_mydata.py --do_train --do_predict --num_epochs 15 --output_dir outputs-gpt2 --model_name gpt2

PS: fine-tuned result model is GPT2-python: shibing624/code-autocomplete-gpt2-base, I spent about 24 hours with V100 to fine-tune it.

Server

start FastAPI server:

example: examples/server.py

shell cd examples python server.py

open url: http://0.0.0.0:8001/docs

api

Dataset

This allows to customize dataset building. Below is an example of the building process.

Let's use Python codes from Awesome-pytorch-list

  1. We want the model to help auto-complete codes at a general level. The codes of The Algorithms suits the need.
  2. This code from this project is well written (high-quality codes).

dataset tree:

shell examples/download/python ├── train.txt └── valid.txt └── test.txt

There are three ways to build dataset: 1. Use the huggingface/datasets library load the dataset huggingface datasets https://huggingface.co/datasets/shibing624/source_code

python from datasets import load_dataset dataset = load_dataset("shibing624/source_code", "python") # python or java or cpp print(dataset) print(dataset['test'][0:10])

output: shell DatasetDict({ train: Dataset({ features: ['text'], num_rows: 5215412 }) validation: Dataset({ features: ['text'], num_rows: 10000 }) test: Dataset({ features: ['text'], num_rows: 10000 }) }) {'text': [ " {'max_epochs': [1, 2]},\n", ' refit=False,\n', ' cv=3,\n', " scoring='roc_auc',\n", ' )\n', ' search.fit(*data)\n', '', ' def test_module_output_not_1d(self, net_cls, data):\n', ' from skorch.toy import make_classifier\n', ' module = make_classifier(\n' ]}

  1. Download dataset from Cloud

| Name | Source | Download | Size | | :------- | :--------- | :---------: | :---------: | | Python+Java+CPP source code | Awesome-pytorch-list(5.22 Million lines) | githubsourcecode.zip | 105M |

download dataset and unzip it, put to examples/.

  1. Get source code from scratch and build dataset

preparecodedata.py

shell cd examples python prepare_code_data.py --num_repos 260

Contact

  • Issue(建议) :GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我: 加我微信号:xuming624, 备注:个人名称-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了codeassist,请按如下格式引用:

APA: latex Xu, M. codeassist: Code AutoComplete with GPT model (Version 1.0.0) [Computer software]. https://github.com/shibing624/codeassist

BibTeX: latex @software{Xu_codeassist, author = {Ming Xu}, title = {CodeAssist: Code AutoComplete with Generation model}, url = {https://github.com/shibing624/codeassist}, version = {1.0.0} }

License

This repository is licensed under the The Apache License 2.0.

Please follow the Attribution-NonCommercial 4.0 International to use the WizardCoder model.

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python setup.py test来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Reference

Owner

  • Name: xuming
  • Login: shibing624
  • Kind: user
  • Location: Beijing, China
  • Company: @tencent

Senior Researcher, Machine Learning Developer, Advertising Risk Control.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xu"
  given-names: "Ming"
title: "code-autocomplete: Code AutoComplete with GPT2 model"
url: "https://github.com/shibing624/code-autocomplete"
data-released: 2022-03-01
version: 0.0.4

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 100
  • Total Committers: 2
  • Avg Commits per committer: 50.0
  • Development Distribution Score (DDS): 0.02
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
shibing624 s****4@1****m 98
flemingxu f****u@t****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: 12 days
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mrT23 (1)
  • donghaiwang (1)
  • ChiYeungLaw (1)
  • fade-color (1)
Pull Request Authors
Top Labels
Issue Labels
wontfix (2) question (2) enhancement (1)
Pull Request Labels

Dependencies

requirements.txt pypi
  • loguru *
  • pandas *
  • transformers >=4.6.0
setup.py pypi
  • loguru *
  • pandas *
  • transformers >=4.6.0