codocbench
CoDocBench is a dataset for code-documentation alignment in software maintenance. This repository contains the dataset, source code used to extract it, examples of using it, and some statistics about it.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary
Repository
CoDocBench is a dataset for code-documentation alignment in software maintenance. This repository contains the dataset, source code used to extract it, examples of using it, and some statistics about it.
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance
This repository contains the CoDocBench dataset, a dataset for code-documentation alignment in software maintenance. The dataset is composed of 4,573 code-documentation pairs extracted from 200 open-source Python projects.
Dataset Description
To use the CoDocBench dataset mentioned in the paper, you can find the dataset in the dataset folder. The folder contains the following files:
codocbench.jsonl: The main dataset file containing 4573 code-documentation pairs.test.jsonl: The test dataset file containing 2273 code-documentation pairs from a random selection of 50% of the projects.train.jsonl: The training dataset file containing 2300 code-documentation pairs from the remaining 50% of the projects.
The dataset is in JSONL format, and each line contains a JSON file with the following fields:
``` json { "file": "string", // File name or path. "function": "string", // Fully qualified function/method name. "versiondata": [ // List of version-specific data. { "version1": "string", // Version identifier. "docstringlines": { // Docstring line range. "startline": "integer", "endline": "integer" }, "codelines": { // Code line range. "startline": "integer", "endline": "integer" }, "commitdatetime": "string",// Timestamp of the commit. "commitsha": "string", // Commit hash. "commitmessage": "string", // Commit message. "docstring": "string", // Function docstring. "code": "string" // Function code. }, { "version2": "string", // Version identifier. "docstringlines": { // Docstring line range. "startline": "integer", "endline": "integer" }, "codelines": { // Code line range. "startline": "integer", "endline": "integer" }, "commitdatetime": "string",// Timestamp of the commit. "commitsha": "string", // Commit hash. "commitmessage": "string", // Commit message. "docstring": "string", // Function docstring. "code": "string" // Function code. } ], "diffcode": "string", // Unified diff for the function code. "diffdocstring": "string", // Unified diff for the docstring. "whitespaceonlycode": "boolean", // Indicates if code diff is whitespace-only. "whitespaceonlydocstring": "boolean", // Indicates if docstring diff is whitespace-only. "filepath": "string", // Full file path. "filename": "string", // File name. "project": "string", // Project name. "owner": "string" // Owner of the repository. }
```
Extracting Your Own Dataset
To extract your own dataset, follow these steps:
Clone the repository:
bash git clone https://github.com/kunpai/codocbench.gitInstall the required dependencies:
bash ./setup.shNOTE: This script sets up a virtual environment and installs the required dependencies. It defaults to Python version 3.13.
If you have a different Python version:
bash ./setup.sh <PYTHON_VERSION>where<PYTHON_VERSION>is the version of Python you want to use.If you prefer to use your own environment, you can install the dependencies manually by running:
bash pip install -r requirements(Be sure to give the appropriate permissions to the script by running
chmod +x setup.sh)Run the virtual environment:
bash source codocbench-env/bin/activateTo extract your own dataset, you can use the
parse.pyscript. The script has a few variants that you can use to customize the extraction process.Variant 1: Extracting from a single project
To extract code-documentation pairs from a single project, you can use the following command:
bash python parse.py owner repowhere
owneris the owner of the repository andrepois the name of the repository.Variant 2: Extracting from multiple projects
To extract code-documentation pairs from multiple projects, you can use the following command:
bash python parse.pyThis command will extract code-documentation pairs from all the projects listed in the
projects.csvfile. Ensure that theprojects.csvfile contains the owner and repository name of the projects you want to extract, separated by a comma.The
projects.csvfile in this repository contains the owner and repository name of the projects used in the CoDocBench dataset.Variant 3: Extracting from a specific file
To extract code-documentation pairs from a specific file, you can use the following command:
bash python parse.py owner repo pathwhere
owneris the owner of the repository,repois the name of the repository, andpathis the path to the file.NOTE: The path should be relative to the root of the repository, and it should exist in the latest commit of the repository.
The extracted code-documentation pairs will be saved in the
differ_files/folder in JSONL format. The file name will be in the formatcodocbench.jsonl.
The parse.py script also records solitary docstring changes and solitary code changes in the differ_files/ folder. The file name will be in the format combined_diff_mapping_docstring_.jsonl and combined_diff_mapping_code_.jsonl, respectively. However, these are not post-processed and may contain false positives.
Examples
Example scripts of using the dataset are provided in the examples folder. The scripts demonstrate how to load the dataset and use it for various tasks.
For most of the examples, you can run the script using the following command:
bash
python examples/<FILENAME>.py <PATH_TO_DATASET>
where <FILENAME> is the name of the script and <PATH_TO_DATASET> is the path to the dataset file.
In case of the 3-shot learning examples, you can run the script using the following command:
bash
python examples/<FILENAME>.py <PATH_TO_DATASET> <PATH_TO_TRAIN_DATASET>
where <FILENAME> is the name of the script, <PATH_TO_DATASET> is the path to the dataset file, and <PATH_TO_TRAIN_DATASET> is the path to the training dataset file.
All these files load meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo as the default model. You can change the model by running the script with the --model flag:
bash
python examples/<FILENAME>.py <PATH_TO_DATASET> --model=<MODEL_NAME>
where <MODEL_NAME> is the name of the model you want to use.
Owner
- Name: Kunal Pai
- Login: kunpai
- Kind: user
- Company: @darchr
- Repositories: 4
- Profile: https://github.com/kunpai
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Pai" given-names: "Kunal" orcid: "https://orcid.org/0009-0003-0675-7135" - family-names: "Devanbu" given-names: "Premkumar" orcid: "https://orcid.org/0000-0002-4346-5276" - family-names: "Ahmed" given-names: "Toufique" orcid: "https://orcid.org/0000-0002-4427-1350" title: "CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance" version: 1.0 doi: 10.5281/zenodo.14251622 date-released: 2024-11-30 url: "https://github.com/kunpai/codocbench"
GitHub Events
Total
- Create event: 4
- Issues event: 2
- Release event: 4
- Watch event: 1
- Issue comment event: 2
- Public event: 1
- Push event: 5
- Fork event: 1
Last Year
- Create event: 4
- Issues event: 2
- Release event: 4
- Watch event: 1
- Issue comment event: 2
- Public event: 1
- Push event: 5
- Fork event: 1