regressors-regressions-dataset
Dataset of bug-introducing and bug-fixing commit sets from Mozilla's Bugzilla
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
Dataset of bug-introducing and bug-fixing commit sets from Mozilla's Bugzilla
Basic Info
Statistics
- Stars: 7
- Watchers: 3
- Forks: 3
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
The dataset
This dataset has been published in "SZZ in the time of Pull Requests" and is a collection of 11991 links from bug-introducing and bug-fixing commit sets extracted from Mozilla's Bugzilla (https://bugzilla.mozilla.org) with the use of bugbug.
In addition, the dataset also contains a set of 8906 links where the fix has not been found yet (so the bug-introducing commit set is known, but the bug-fixing commit set doesn't exist yet).
The dataset was generated by using bugbug, Mozilla's platform for Machine Learning projects on Software Engineering.
Bug IDs refer to Bugzilla bug reports, e.g. 1856572 is https://bugzilla.mozilla.org/show_bug.cgi?id=1856572.
Mercurial hashes refer to commits in the mozilla-central repository, e.g. 3c1db459589a845238abc0359c581fb436a9458f is https://hg.mozilla.org/mozilla-central/rev/3c1db459589a845238abc0359c581fb436a9458f.
Git hashes refer to commits in a clone of the mozilla-central repository using git-cinnabar. Use the guide of git
cinnabar for Mozilla and run the following command: git clone hg::https://hg.mozilla.org/mozilla-central.
Terminology
- Bug-introducing commits: a change or set of changes in one or multiple commits, which introduced a bug in the software.
- Bug-fixing commits: a change or set of changes in one or multiple commits, which fixed a bug in the software.
- Regressor: the bug (i.e. its bug-fixing commits) has introduced other bugs.
- Regression: the bug has been introduced by other bugs (i.e. their bug-fixing commits).
- There are various bug fields in Bugzilla associated with the above terms, such as
Regressed byorRegressions. Please refer to the fields in the Mozilla Wiki.
Note: a bug-fixing change can also be a bug-introducing change, and viceversa. Sometimes developers will fix bugs and introduce new ones in the process. Also, it is possible for a bug to have only either of them, fix commit or bug-introducing commit, as esablishing a link between them is not always easy.
Example usage of the dataset
Run the example.py script with Python to see some high-level statistics about the dataset:
Python
python example.py
The output is:
Total number of pairs: 20897
Total number of pairs where both bug-introducing and bug-fix are known: 11991
Number of pairs with no shared files: 3126
Number of pairs where the bug-fix only contains new lines: 1869
Number of pairs where the bug-introducing only contains removed lines: 998
Number of pairs where the bug-introducing is not linked to any commit: 880
Number of bugs which are not fixed yet and where the cause has been identified: 8906
Deciles for the number of commits associated to bug fixes:
[0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 2.0]
Deciles for the number of commits associated to bug introducing:
[1.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 6.0, 12.0]
Note: The pairs mentioned in the output refer to the pairs of bug-introducing and bug-fixing commit sets. The shared files are also given as identifying the shared files (common modified files) in pairs is one way to link the pairs of bug-introducing and bug-fixing commit sets.
Format
In this repository, you will find the dataset in two alternative formats:
- A CSV file containing all the information related to each issue
- A JSON file in a format compatible with PySZZ
In the CSV file, each record represents a link between a bug-introducing and a bug-fixing commit set, and it contains the following columns:
FIX_ID: the ID of the regression bug (bug linked to the bug-fixing commits) on BugzillaFIX_COMMITS_MERCURIAL: the list of bug-fixing commit hashes from the original Mozilla mozilla-central repository hosted in MercurialFIX_COMMITS_GIT: the list of bug-fixing commit hashes on the mozilla-central mirror git repository (see https://github.com/glandium/git-cinnabar)BUG_IDS: the IDs of the regressor bugs (bugs linked to the bug-introducing commits) on BugzillaBUG_COMMITS_MERCURIAL: the list of bug-introducing commit hashes from the original mozilla-central repository hosted in MercurialBUG_COMMITS_GIT: the list of fix-inducing commit hashes on the mozilla-central mirror git repositoryNO_FILE_SHARED: a boolean value.TRUEif no file is shared between the bug-fixing and bug-introducing commit sets (Extrinsic Bug/Ghost Commits),FALSEotherwise.NEW_LINES_ONLY_FIX: a boolean value.TRUEif there are only added lines in the bug-fixing commit-set (Ghost Commits),FALSEotherwise.REMOVE_LINES_ONLY_BUG: a boolean value.TRUEif there are only removed lines in the bug-introducing commit-set (Ghost Commits),FALSEotherwise.NO_BUG: a boolean value.TRUEif there is no commit linked to the regressor bug (Extrinsic Bug),FALSEotherwise.
Note: Please refer to the following work "SZZ in the time of Pull Requests" to know more details about ghost commits, extrinsic bugs, identifying bug-introducing commits and bug-fixing commits, and establishing link between these commit sets.
References
Please cite these works if you use the data in this repo.
@misc{petrulio2022szz,
title={SZZ in the time of Pull Requests},
author={Fernando Petrulio and David Ackermann and Enrico Fregnan and Gül Calikli and Marco Castelluccio and Sylvestre Ledru and Calixte Denizet and Emma Humphries and Alberto Bacchelli},
year={2022},
eprint={2209.03311},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
@software{castelluccio_bugbug,
author = {Castelluccio, Marco},
title = {mozilla/bugbug},
month = jan,
year = 2024,
publisher = {Zenodo},
version = {v0.0.533},
doi = {10.5281/zenodo.4911345},
url = {https://github.com/mozilla/bugbug},
license = {MPL-2.0}
}
Owner
- Name: Mozilla
- Login: mozilla
- Kind: organization
- Location: Mountain View, California
- Website: https://wiki.mozilla.org/Github
- Repositories: 2,423
- Profile: https://github.com/mozilla
This technology could fall into the right hands.
Citation (CITATION.cff)
cff-version: 1.2.0
title: SZZ in the time of Pull Requests
message: "If you use this dataset, please cite it using the metadata from this file."
type: dataset
doi: 10.48550/arXiv.2209.03311
date-released: 2022-09-07
repository-code: https://github.com/mozilla/regressors-regressions-dataset
license: MPL-2.0
authors:
- given-names: Fernando
family-names: Petrulio
email: fpetrulio@ifi.uzh.ch
affiliation: University of Zurich
- given-names: David
family-names: Ackermann
email: david.ackermann@uzh.ch
affiliation: University of Zurich
- given-names: Enrico
family-names: Fregnan
email: fregnan@ifi.uzh.ch
affiliation: University of Zurich
- given-names: Gül
family-names: Çalikli
affiliation: University of Glasgow
email: handangul.calikli@glasgow.ac.uk
- given-names: Marco
family-names: Castelluccio
email: mcastelluccio@mozilla.com
affiliation: Mozilla
orcid: https://orcid.org/0000-0002-3285-5121
- given-names: Sylvestre
family-names: Ledru
email: sledru@mozilla.com
affiliation: Mozilla
- given-names: Calixte
family-names: Denizet
email: cdenizet@mozilla.com
affiliation: Mozilla
- given-names: Emma
family-names: Humphries
email: me@emmah.net
affiliation: Mozilla
- given-names: Alberto
family-names: Bacchelli
email: bacchelli@ifi.uzh.ch
affiliation: University of Zurich
GitHub Events
Total
- Issues event: 1
Last Year
- Issues event: 1
Committers
Last synced: 11 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Marco Castelluccio | m****o@m****m | 18 |
| Pooja Ruhal | p****5@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 9 days
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- marco-c (1)
Pull Request Authors
- marco-c (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- PyDriller 1.12
- bugbug 0.0.482
- libmozdata 0.1.82
- python >=3.9,<3.12
- tqdm ^4.64.1