Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Repository
How are programming errors distributed
Basic Info
- Host: GitHub
- Owner: llewelld
- License: mit
- Language: TeX
- Default Branch: master
- Size: 5.58 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
bisect
How are programming errors distributed?
This repository contains scripts used for analysing repositories to determine how regressions are distributed between the release directly prior to the commit and the commit that fixes them.
It also contains the write-up of the analysis and results.
Useful links
- Scripts and LaTeX source: https://github.com/llewelld/bisect
- Dataset: https://osf.io/wxyaj/
Process
1. Obtain the dataset
You can either download the same dataset used for the write-up from the Open Science Framework or generate your own dataset from GitHub. The former requires downloading a single 1.1 GB archive, the latter collects the data using the GitHub API. Downloading the archive will be a lot quicker.
Follow 1a or 1b below depending on which route you prefer to take.
1a. Download existing dataset
The dataset is publicly available from osf.io. You can browse it at https://osf.io/wxyaj/.
The following command can be uesd to download the full dataset. The archive is 1.1 GB in size (4.3 GB when decompressed), so may take some time to download on a slower connection.
``` cd bisect curl -X "GET" \ 'https://files.de-1.osf.io/v1/resources/wxyaj/providers/osfstorage/?zip=' \
results.zip ```
You should then unzip the result into the root of the project folder.
unzip results.zip
1b. Collect data from GitHub
This will collect data about projects on GitHub.
```
Syntax: github-find
Searches github for repositories containing the requested language
bare clones them and analysis the history and diffs to discover
regressions and their fixes.
<forks|updated> : order based on number of forks or most recent update time.
<language> : the language to search for.
```
For info about the ordering, see: https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories
These commands will output the following files:
./results/index.json
./results/{language}/data{xxxxx}.json
Example usage:
./github-find.py forks c
2. Analyse the data
This will analyse the data collected from projects on GitHub.
```
Syntax: analyse-bisect
Reads in details of commits for different projects and applies the
bisect algorithm to them.
Example usage:
mkdir -d stats/c
./analysis-bisect results/c/ stats/c/commits.json commits
./analysis-bisect results/c/ stats/c/lines.json lines
./analysis-bisect results/c/ stats/c/blocks.json blocks
```
This will analyse the commit data extracted from github and output a set of distances between a regression and its surrounding releases. It will also perform the bisect algorithm on the regressions and record how many steps were required to discover the regression.
The input directory should be the output directory from step 1:
./results/{language}
The output file might then be something like:
./stats/{language}/{measure}.json
3. Perform regression tests
The regression tests will fit three different curve types to the histogram of distances generated by the previous step.
```
Syntax: regression-nfold.py
Perform n-fold cross-validation regressions.
Example usage:
./regression-nfold.py stats/c/commits.json
./regression-nfold.py stats/c/lines.json
./regression-nfold.py stats/c/blocks.json
```
The input directory should be the output directory from step 2.
./stats/{language}/{measure}.json
The coefficients from the various curves will be printed to stdout.
4. Apply the weighted bisect algorithm
This will apply the bisect algorithm to the same results, but this time using a distance metric weighted based on the curves generated in step 3.
```
Syntax: analyse-bisect
Reads in details of commits for different projects and applies the
bisect algorithm to them.
Note that this uses the same script as in step 2, but this time we supply the coefficients of the curves.
Example usage:
./analyse-bisect.py results/c/ stats/c/commits-weighted.json commits \
10.62988956267997 -95.85165636436697 91.51799274623352
./analyse-bisect.py results/c/ stats/c/lines-weighted.json lines \
11.080415518301999 -124.71439313216983 120.23301484877652
./analyse-bisect.py results/c/ stats/c/blocks-weighted.json blocks \
10.872833015656356 -110.3757504329785 105.9173162006048
This process performs the bisect algorithm based on the data output in json
format by github-find to the results/{language} directory. By convention
this is then stored in a file ./stats/{language}/{filename}.json
The coefficients added as command line parameters are most likely to be those output by the regression tests performed in step 3.
5. Perform uniformity tests
The uniformity tests give an indication of whether the distribution of regressions is uniform.
```
Syntax: statistics
Generate statistics for all json analysis files (generated by
analysis.py) in a given directory.
This performs tests on the results from the bisect analysis, and the regression data. For example it will normalise the regression distance between the discovery and fix points and calculate the mean and standard deviation of the results.
It also performs a chi-squared test to determine whether the distribution of normalised fix distances is uniform.
./statistics stats/c 100
6. Generate graphs
This steps generates a number of graphs, as used in the write-up.
```
Syntax: graphs
Plots a histogram of reverts against distance.
The graphs generated will have filenames graph000.png, graph001.png with
the filename number incrementing. The count-start parameter can be used to
start the filename number at a number higher than zero.
Example usage
./graphs.py stats/c/ writeup/figures/
./graphs.py stats/javascript/ writeup/figures/ 10
7. Perform significance tests
The significance tests takes the results from applying two different bisect algorithm types and compares them.
```
Syntax: significance-test [-q]
Perform a Wilcoxon Signed Rank significance test on the two datasets.
[-q] : only output the results
This will perform a Wilcoxon Signed Rank significance test on the two sets of data. First a two-sided test, and following that a one-sided test.
The output provides more detail about what the results mean.
The two input files should be different sets of results output by the
analyse-bisect.py script in steps 2 and steps 4. Since this is a paired test
the files must both be for the same language (i.e. files generated from the
same raw data).
Example usage
./significance-tests.py stats/c/commits.json stats/c/commits-weighted.json
./significance-tests.py stats/c/commits.json stats/c/lines-weighted.json
./significance-tests.py stats/c/commits.json stats/c/blocks-weighted.json
The results will be in the form of a Z statistic and a p-value, printed to stdout.
Process summary
The run-regression.sh script performs the full process, except from step 1.
Running the script may take a lot of time and generate a lot of (mostly
uninteresting) output.
The following summarises the process, using C as an example:
./github-find.py forks c++
./analyse-bisect.py results/c/ stats/c/commits.json commits
./regression-nfold.py stats/c/commits.json
./analyse-bisect.py results/c/ stats/c/commits-weighted.json commits \
10.62988956267997 -95.85165636436697 91.51799274623352
./graphs.py stats/c/ writeup/figures/
./significance-tests.py -q stats/c/commits.json stats/c/commits-weighted.json
Compile the LaTeX source
The Makefile in the writeup folder is set up to automatically build a pdf
file from the LaTeX source.
cd writeup
make
xdg-open bisect.pdf
Notes
The following notes may be helpful in understanding how the results were generated. In many cases the python scripts for a repository can be performed at the console instead.
These steps are for illustrative purposes. They're not used in this form in the actual analysis.
Find reverts
git log --format="%H" --grep "revert"
Show commit for tag
git show --format="%H" 1.8.2 | head -1
git rev-list -n 1 1.8.2
List latest tag before
git tag --contains dcda2c6 --sort=committerdate | head -1
git describe --tag dcda2c6
List earliest tag after
git tag --no-contains dcda2c6 --sort=committerdate | tail -1
git describe --tag --contains dcda2c6
Earliest commit
git rev-list --max-parents=0 HEAD
Latest commit
```
HEAD
```
Commit revert refers to
$r = `git log --format="%b" $i | grep -o -E "[0-9a-f]{7}|[0-9a-f]{40}"`
Format
for i in `git log --format="%H" --grep "revert"`; do
r=`git log --format="%b" -1 $i | grep -o -E "[0-9a-f]{7}|[0-9a-f]{40}"`
if [ -n "$r" ]; then
before=`git describe --tag --first-parent --abbrev=0 $r 2>/dev/null | cut -f1 -d"~"`
after=`git describe --tag --first-parent --abbrev=0 --contains $r 2>/dev/null | cut -f1 -d"~"`
below=`[ -z "$before" ] && echo \`git rev-list --max-parents=0 HEAD\` || echo $before`
above=`[ -z "$after" ] && echo "HEAD" || echo $after`
echo `git rev-list -n 1 $below` $r $i `git rev-list -n 1 $above`
fi
done
Owner
- Name: David Llewellyn-Jones
- Login: llewelld
- Kind: user
- Location: Cambridge, UK
- Company: The Alan Turing Institute
- Website: https://www.flypig.co.uk
- Repositories: 163
- Profile: https://github.com/llewelld
Research Data Scientist at the Alan Turing Institute. Occasionally craves adventure and a good thunderstorm.
Citation (CITATION.cff)
cff-version: 1.2.0
title: Bisect analysis tools
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: David
family-names: Llewellyn-Jones
email: david@flypig.co.uk
affiliation: Jolla Oy
orcid: 'https://orcid.org/0000-0002-3836-7903'
- given-names: Franz-Josef
family-names: Haider
affiliation: Jolla Oy
orcid: 'https://orcid.org/0000-0002-9274-3201'
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| David Llewellyn-Jones | d****d@f****k | 60 |
| Frajo Haider | f****r@g****t | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 2 days
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- krnlyng (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- Pillow ==9.2.0
- contourpy ==1.0.5
- cycler ==0.11.0
- fonttools ==4.38.0
- kiwisolver ==1.4.4
- matplotlib ==3.6.1
- numpy ==1.23.4
- packaging ==21.3
- pyparsing ==3.0.9
- python-dateutil ==2.8.2
- six ==1.16.0