many-types-4-py-dataset

ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference

https://github.com/saltudelft/many-types-4-py-dataset

Science Score: 41.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Keywords

benchmark clean dataset machine-learning manytypes4py msr mt4py python type-annotations type-checked type-inference visible-type-hints

Last synced: 6 months ago · JSON representation ·

Repository

ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference

Basic Info

Host: GitHub
Owner: saltudelft
License: apache-2.0
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 14.6 MB

Statistics

Stars: 18
Watchers: 7
Forks: 5
Open Issues: 1
Releases: 0

Topics

benchmark clean dataset machine-learning manytypes4py msr mt4py python type-annotations type-checked type-inference visible-type-hints

Created over 5 years ago · Last pushed almost 4 years ago

Metadata Files

Readme Changelog License Citation

ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

Intro
Download
Preparation
Citing MT4Py
Roadmap

Intro

It has clean and complete versions (from v0.7):
- The clean version has 5.1K type-checked Python repositories and 1.2M type annotations.
- The complete version has 5.2K Python repositories and 3.3M type annotations.
Its source files are type-checked using mypy (clean version).
Its projects were processed in JSON-formatted files using the LibSA4Py pipeline.
Its source files were already split into training, validation, and test sets for training ML models.
It is de-duplicated using CD4Py.
It contains Visible Type Hints (VTHs), which is a deep, recursive, and dynamic analysis of types from the import statements of source files and their dependencies.
It is published in the Data Showcase of the MSR'21 conference.

Downloading dataset

The latest version of the dataset is publicly available on zenodo.

Dataset preparation

We highly recommend downloading the latest version of the dataset as described above. If you want to manually prepare the dataset, follow below instructions.

Requirements

Python 3.5 or newer
Python dependencies from scripts/requirements.txt installed (run pip install -r scripts/requirements.txt)
Install the libsa4py package (run git clone https://github.com/saltudelft/libsa4py.git && cd libsa4py && pip install .)

Steps

Clone the dataset:

python -m repo_cloner -i ./mypy-dependents-by-stars.json -o repos
To change the state of the cloned repositories to the ManyType4Py's, run the following command on the ManyTypes4PyDataset.spec:

./scripts/reset_commits.sh ./ManyTypes4PyDataset.spec repos
Generate duplicate tokens for dataset using cd4py

cd4py --p repos --ot tokens --od manytypes4py_dataset_duplicates.jsonl.gz --d 1024
Gather duplicate files from the cd4py output tokens, and output as a single text file (using collect_dupes.py)

python3 scripts/collect_dupes.py manytypes4py_dataset_duplicates.jsonl.gz manytypes4py_dup_files.txt
Create a copy dataset with duplicates removed from the duplicate files collected prior (using process_dataset.py)

python3 scripts/process_dataset.py repos manytypes4py_dup_files.txt [new dataset path]
Split dataset into test, train and validation data (using split_dataset.py)

python3 scripts/split_dataset.py [new dataset path] manytypes4py_split.csv
To process the Python repositories and produce JSON output files, run the libsa4py pipeline as follows:

libsa4py process --p [new dataset path] --o [processed projects path] --s manytypes4py_split.csv --j [WORKERS COUNT]

Check out the libsa4py README for more info on its usage.
Create a tar of the full dataset & artifacts in one folder

tar -czvf [output path] [dataset artifacts path]

Citing the dataset

If you have used the dataset in your research work, please consider citing it.

@inproceedings{mt4py2021, author = {A. M. Mir and E. Latoskinas and G. Gousios}, booktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)}, title = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference}, year = {2021}, pages = {585-589}, doi = {10.1109/MSR52588.2021.00079}, publisher = {IEEE Computer Society}, month = {May} }

Roadmap

Gathering Python projects that depend on type-checkers other than mypy, i.e., pyre, pytype, and pyright.
Apply type annotations from typeshed to the dataset.

Owner

Name: Software Analytics Lab
Login: saltudelft
Kind: organization
Location: Delft, NL

Website: https://se.ewi.tudelft.nl/research-lines/software-analytics/
Repositories: 9
Profile: https://github.com/saltudelft

Software Analytics Lab @ TU Delft

Citation (CITATION.bib)

@inproceedings{mt4py2021,
author = {A. M. Mir and E. Latoskinas and G. Gousios},
booktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
title = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference},
year = {2021},
pages = {585-589},
doi = {10.1109/MSR52588.2021.00079},
publisher = {IEEE Computer Society},
month = {May}
}

GitHub Events

Total

Watch event: 5
Fork event: 1

Last Year

Watch event: 5
Fork event: 1

Dependencies

scripts/requirements.txt pypi

cd4py *
dpu-utils *
pandas *
sklearn *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science