bisect

How are programming errors distributed

https://github.com/llewelld/bisect

Last synced: 10 months ago · JSON representation ·

Repository

How are programming errors distributed

Basic Info

Host: GitHub
Owner: llewelld
License: mit
Language: TeX
Default Branch: master
Size: 5.58 MB

Statistics

Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 6 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

bisect

How are programming errors distributed?

This repository contains scripts used for analysing repositories to determine how regressions are distributed between the release directly prior to the commit and the commit that fixes them.

It also contains the write-up of the analysis and results.

Useful links

Scripts and LaTeX source: https://github.com/llewelld/bisect
Dataset: https://osf.io/wxyaj/

Process

1. Obtain the dataset

You can either download the same dataset used for the write-up from the Open Science Framework or generate your own dataset from GitHub. The former requires downloading a single 1.1 GB archive, the latter collects the data using the GitHub API. Downloading the archive will be a lot quicker.

Follow 1a or 1b below depending on which route you prefer to take.

1a. Download existing dataset

The dataset is publicly available from osf.io. You can browse it at https://osf.io/wxyaj/.

The following command can be uesd to download the full dataset. The archive is 1.1 GB in size (4.3 GB when decompressed), so may take some time to download on a slower connection.

``` cd bisect curl -X "GET" \ 'https://files.de-1.osf.io/v1/resources/wxyaj/providers/osfstorage/?zip=' \

results.zip ```

You should then unzip the result into the root of the project folder. unzip results.zip

1b. Collect data from GitHub

This will collect data about projects on GitHub.

``` Syntax: github-find

Searches github for repositories containing the requested language
bare clones them and analysis the history and diffs to discover
regressions and their fixes.
<forks|updated> : order based on number of forks or most recent update time.
<language>      : the language to search for.

```

For info about the ordering, see: https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories

These commands will output the following files: ./results/index.json ./results/{language}/data{xxxxx}.json

Example usage: ./github-find.py forks c

2. Analyse the data

This will analyse the data collected from projects on GitHub.

``` Syntax: analyse-bisect [a1 a2 a3]

Reads in details of commits for different projects and applies the bisect algorithm to them. : a directory containing json commit files to perform the bisect algorithm on. : a file to output the results to in json format. : one of "commits", "lines" or "blocks". [a1 a2 a3] : optional exponential polynomial coefficients $e^{a3 x^2 + a2 x + a1}$. Example usage: mkdir -d stats/c ./analysis-bisect results/c/ stats/c/commits.json commits ./analysis-bisect results/c/ stats/c/lines.json lines ./analysis-bisect results/c/ stats/c/blocks.json blocks ```

This will analyse the commit data extracted from github and output a set of distances between a regression and its surrounding releases. It will also perform the bisect algorithm on the regressions and record how many steps were required to discover the regression.

The input directory should be the output directory from step 1: ./results/{language} The output file might then be something like: ./stats/{language}/{measure}.json

3. Perform regression tests

The regression tests will fit three different curve types to the histogram of distances generated by the previous step.

``` Syntax: regression-nfold.py

Perform n-fold cross-validation regressions. : a preprocessed stats file in json format. Example usage: ./regression-nfold.py stats/c/commits.json ./regression-nfold.py stats/c/lines.json
./regression-nfold.py stats/c/blocks.json ```

The input directory should be the output directory from step 2. ./stats/{language}/{measure}.json

The coefficients from the various curves will be printed to stdout.

4. Apply the weighted bisect algorithm

This will apply the bisect algorithm to the same results, but this time using a distance metric weighted based on the curves generated in step 3.

``` Syntax: analyse-bisect [a1 a2 a3]

Reads in details of commits for different projects and applies the bisect algorithm to them. : a directory containing json commit files to perform the bisect algorithm on. : a file to output the results to in json format. : one of "commits", "lines" or "blocks". [a1 a2 a3] : optional exponential polynomial coefficients $e^{a3 x^2 + a2 x + a1}$. ```

Note that this uses the same script as in step 2, but this time we supply the coefficients of the curves.

Example usage: ./analyse-bisect.py results/c/ stats/c/commits-weighted.json commits \ 10.62988956267997 -95.85165636436697 91.51799274623352 ./analyse-bisect.py results/c/ stats/c/lines-weighted.json lines \ 11.080415518301999 -124.71439313216983 120.23301484877652 ./analyse-bisect.py results/c/ stats/c/blocks-weighted.json blocks \ 10.872833015656356 -110.3757504329785 105.9173162006048

This process performs the bisect algorithm based on the data output in json format by github-find to the results/{language} directory. By convention this is then stored in a file ./stats/{language}/{filename}.json

The coefficients added as command line parameters are most likely to be those output by the regression tests performed in step 3.

5. Perform uniformity tests

The uniformity tests give an indication of whether the distribution of regressions is uniform.

``` Syntax: statistics [bucket-min]

Generate statistics for all json analysis files (generated by analysis.py) in a given directory. : the directory to read analysis json files from. [bucket-min] : minimum number of buckets to allow; defaults to 3. ```

This performs tests on the results from the bisect analysis, and the regression data. For example it will normalise the regression distance between the discovery and fix points and calculate the mean and standard deviation of the results.

It also performs a chi-squared test to determine whether the distribution of normalised fix distances is uniform.

./statistics stats/c 100

6. Generate graphs

This steps generates a number of graphs, as used in the write-up.

``` Syntax: graphs [count-start]

Plots a histogram of reverts against distance. : a directory containing preprocessed stats files commits.json, lines.json and blocks.json. : directory to save the output PNGs to, in the form graph000.png, graph001.png, ... [count-start] : value to start the output filename enumeration from; defaults to 0. ```

The graphs generated will have filenames graph000.png, graph001.png with the filename number incrementing. The count-start parameter can be used to start the filename number at a number higher than zero.

Example usage ./graphs.py stats/c/ writeup/figures/ ./graphs.py stats/javascript/ writeup/figures/ 10

7. Perform significance tests

The significance tests takes the results from applying two different bisect algorithm types and compares them.

``` Syntax: significance-test [-q]

Perform a Wilcoxon Signed Rank significance test on the two datasets. [-q] : only output the results : a json file containing regressions and bisect steps using the standard distance metric. : a json file containing regressions and bisect steps using a weighted distance metric. ```

This will perform a Wilcoxon Signed Rank significance test on the two sets of data. First a two-sided test, and following that a one-sided test.

The output provides more detail about what the results mean.

The two input files should be different sets of results output by the analyse-bisect.py script in steps 2 and steps 4. Since this is a paired test the files must both be for the same language (i.e. files generated from the same raw data).

Example usage ./significance-tests.py stats/c/commits.json stats/c/commits-weighted.json ./significance-tests.py stats/c/commits.json stats/c/lines-weighted.json ./significance-tests.py stats/c/commits.json stats/c/blocks-weighted.json

The results will be in the form of a Z statistic and a p-value, printed to stdout.

Process summary

The run-regression.sh script performs the full process, except from step 1. Running the script may take a lot of time and generate a lot of (mostly uninteresting) output.

The following summarises the process, using C as an example: ./github-find.py forks c++ ./analyse-bisect.py results/c/ stats/c/commits.json commits ./regression-nfold.py stats/c/commits.json ./analyse-bisect.py results/c/ stats/c/commits-weighted.json commits \ 10.62988956267997 -95.85165636436697 91.51799274623352 ./graphs.py stats/c/ writeup/figures/ ./significance-tests.py -q stats/c/commits.json stats/c/commits-weighted.json

Compile the LaTeX source

The Makefile in the writeup folder is set up to automatically build a pdf file from the LaTeX source.

cd writeup make xdg-open bisect.pdf

Notes

The following notes may be helpful in understanding how the results were generated. In many cases the python scripts for a repository can be performed at the console instead.

These steps are for illustrative purposes. They're not used in this form in the actual analysis.

Find reverts

git log --format="%H" --grep "revert"

Show commit for tag

git show --format="%H" 1.8.2 | head -1 git rev-list -n 1 1.8.2

List latest tag before

git tag --contains dcda2c6 --sort=committerdate | head -1 git describe --tag dcda2c6

List earliest tag after

git tag --no-contains dcda2c6 --sort=committerdate | tail -1 git describe --tag --contains dcda2c6

Earliest commit

git rev-list --max-parents=0 HEAD

Latest commit

```

HEAD

```

Commit revert refers to

$r = `git log --format="%b" $i | grep -o -E "[0-9a-f]{7}|[0-9a-f]{40}"`

Format

for i in `git log --format="%H" --grep "revert"`; do r=`git log --format="%b" -1 $i | grep -o -E "[0-9a-f]{7}|[0-9a-f]{40}"` if [ -n "$r" ]; then before=`git describe --tag --first-parent --abbrev=0 $r 2>/dev/null | cut -f1 -d"~"` after=`git describe --tag --first-parent --abbrev=0 --contains $r 2>/dev/null | cut -f1 -d"~"` below=`[ -z "$before" ] && echo \`git rev-list --max-parents=0 HEAD\` || echo $before` above=`[ -z "$after" ] && echo "HEAD" || echo $after` echo `git rev-list -n 1 $below` $r $i `git rev-list -n 1 $above` fi done

Owner

Name: David Llewellyn-Jones
Login: llewelld
Kind: user
Location: Cambridge, UK
Company: The Alan Turing Institute

Website: https://www.flypig.co.uk
Repositories: 163
Profile: https://github.com/llewelld

Research Data Scientist at the Alan Turing Institute. Occasionally craves adventure and a good thunderstorm.

Citation (CITATION.cff)

cff-version: 1.2.0
title: Bisect analysis tools
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: David
    family-names: Llewellyn-Jones
    email: david@flypig.co.uk
    affiliation: Jolla Oy
    orcid: 'https://orcid.org/0000-0002-3836-7903'
  - given-names: Franz-Josef
    family-names: Haider
    affiliation: Jolla Oy
    orcid: 'https://orcid.org/0000-0002-9274-3201'

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 61
Total Committers: 2
Avg Commits per committer: 30.5
Development Distribution Score (DDS): 0.016

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
David Llewellyn-Jones	d**d@f**k	60
Frajo Haider	f**r@g**t	1

Committer Domains (Top 20 + Academic)

gmx.at: 1 flypig.co.uk: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 2 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

bisect

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

bisect

Useful links

Process

1. Obtain the dataset

1a. Download existing dataset

1b. Collect data from GitHub

2. Analyse the data

3. Perform regression tests

4. Apply the weighted bisect algorithm

5. Perform uniformity tests

6. Generate graphs

7. Perform significance tests

Process summary

Compile the LaTeX source

Notes

Find reverts

Show commit for tag

List latest tag before

List earliest tag after

Earliest commit

Latest commit

HEAD

Commit revert refers to

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies