codocbench

CoDocBench is a dataset for code-documentation alignment in software maintenance. This repository contains the dataset, source code used to extract it, examples of using it, and some statistics about it.

https://github.com/kunpai/codocbench

Last synced: 10 months ago · JSON representation ·

Repository

CoDocBench is a dataset for code-documentation alignment in software maintenance. This repository contains the dataset, source code used to extract it, examples of using it, and some statistics about it.

Basic Info

Host: GitHub
Owner: kunpai
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 8.81 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 4

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

README.md

CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

This repository contains the CoDocBench dataset, a dataset for code-documentation alignment in software maintenance. The dataset is composed of 4,573 code-documentation pairs extracted from 200 open-source Python projects.

Dataset Description

To use the CoDocBench dataset mentioned in the paper, you can find the dataset in the dataset folder. The folder contains the following files:

codocbench.jsonl: The main dataset file containing 4573 code-documentation pairs.
test.jsonl: The test dataset file containing 2273 code-documentation pairs from a random selection of 50% of the projects.
train.jsonl: The training dataset file containing 2300 code-documentation pairs from the remaining 50% of the projects.

The dataset is in JSONL format, and each line contains a JSON file with the following fields:

``` json { "file": "string", // File name or path. "function": "string", // Fully qualified function/method name. "versiondata": [ // List of version-specific data. { "version1": "string", // Version identifier. "docstringlines": { // Docstring line range. "startline": "integer", "endline": "integer" }, "codelines": { // Code line range. "startline": "integer", "endline": "integer" }, "commitdatetime": "string",// Timestamp of the commit. "commitsha": "string", // Commit hash. "commitmessage": "string", // Commit message. "docstring": "string", // Function docstring. "code": "string" // Function code. }, { "version2": "string", // Version identifier. "docstringlines": { // Docstring line range. "startline": "integer", "endline": "integer" }, "codelines": { // Code line range. "startline": "integer", "endline": "integer" }, "commitdatetime": "string",// Timestamp of the commit. "commitsha": "string", // Commit hash. "commitmessage": "string", // Commit message. "docstring": "string", // Function docstring. "code": "string" // Function code. } ], "diffcode": "string", // Unified diff for the function code. "diffdocstring": "string", // Unified diff for the docstring. "whitespaceonlycode": "boolean", // Indicates if code diff is whitespace-only. "whitespaceonlydocstring": "boolean", // Indicates if docstring diff is whitespace-only. "filepath": "string", // Full file path. "filename": "string", // File name. "project": "string", // Project name. "owner": "string" // Owner of the repository. }

```

Extracting Your Own Dataset

To extract your own dataset, follow these steps:

Clone the repository:

bash git clone https://github.com/kunpai/codocbench.git
Install the required dependencies:

bash ./setup.sh

NOTE: This script sets up a virtual environment and installs the required dependencies. It defaults to Python version 3.13.

If you have a different Python version: bash ./setup.sh <PYTHON_VERSION> where <PYTHON_VERSION> is the version of Python you want to use.

If you prefer to use your own environment, you can install the dependencies manually by running:

bash pip install -r requirements

(Be sure to give the appropriate permissions to the script by running chmod +x setup.sh)
Run the virtual environment:

bash source codocbench-env/bin/activate
To extract your own dataset, you can use the parse.py script. The script has a few variants that you can use to customize the extraction process.
1. Variant 1: Extracting from a single project
  
  To extract code-documentation pairs from a single project, you can use the following command:
  
  bash python parse.py owner repo
  
  where owner is the owner of the repository and repo is the name of the repository.
2. Variant 2: Extracting from multiple projects
  
  To extract code-documentation pairs from multiple projects, you can use the following command:
  
  bash python parse.py
  
  This command will extract code-documentation pairs from all the projects listed in the projects.csv file. Ensure that the projects.csv file contains the owner and repository name of the projects you want to extract, separated by a comma.
  
  The projects.csv file in this repository contains the owner and repository name of the projects used in the CoDocBench dataset.
3. Variant 3: Extracting from a specific file
  
  To extract code-documentation pairs from a specific file, you can use the following command:
  
  bash python parse.py owner repo path
  
  where owner is the owner of the repository, repo is the name of the repository, and path is the path to the file.
  
  NOTE: The path should be relative to the root of the repository, and it should exist in the latest commit of the repository.
The extracted code-documentation pairs will be saved in the differ_files/ folder in JSONL format. The file name will be in the format codocbench.jsonl.

The parse.py script also records solitary docstring changes and solitary code changes in the differ_files/ folder. The file name will be in the format combined_diff_mapping_docstring_.jsonl and combined_diff_mapping_code_.jsonl, respectively. However, these are not post-processed and may contain false positives.

Examples

Example scripts of using the dataset are provided in the examples folder. The scripts demonstrate how to load the dataset and use it for various tasks.

For most of the examples, you can run the script using the following command:

bash python examples/<FILENAME>.py <PATH_TO_DATASET>

where <FILENAME> is the name of the script and <PATH_TO_DATASET> is the path to the dataset file.

In case of the 3-shot learning examples, you can run the script using the following command:

bash python examples/<FILENAME>.py <PATH_TO_DATASET> <PATH_TO_TRAIN_DATASET>

where <FILENAME> is the name of the script, <PATH_TO_DATASET> is the path to the dataset file, and <PATH_TO_TRAIN_DATASET> is the path to the training dataset file.

All these files load meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo as the default model. You can change the model by running the script with the --model flag:

bash python examples/<FILENAME>.py <PATH_TO_DATASET> --model=<MODEL_NAME>

where <MODEL_NAME> is the name of the model you want to use.

Owner

Name: Kunal Pai
Login: kunpai
Kind: user
Company: @darchr

Repositories: 4
Profile: https://github.com/kunpai

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Pai"
  given-names: "Kunal"
  orcid: "https://orcid.org/0009-0003-0675-7135"
- family-names: "Devanbu"
  given-names: "Premkumar"
  orcid: "https://orcid.org/0000-0002-4346-5276"
- family-names: "Ahmed"
  given-names: "Toufique"
  orcid: "https://orcid.org/0000-0002-4427-1350"
title: "CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance"
version: 1.0
doi: 10.5281/zenodo.14251622
date-released: 2024-11-30
url: "https://github.com/kunpai/codocbench"

GitHub Events

Total

Create event: 4
Issues event: 2
Release event: 4
Watch event: 1
Issue comment event: 2
Public event: 1
Push event: 5
Fork event: 1

Last Year

Create event: 4
Issues event: 2
Release event: 4
Watch event: 1
Issue comment event: 2
Public event: 1
Push event: 5
Fork event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science