regressors-regressions-dataset

Dataset of bug-introducing and bug-fixing commit sets from Mozilla's Bugzilla

https://github.com/mozilla/regressors-regressions-dataset

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Dataset of bug-introducing and bug-fixing commit sets from Mozilla's Bugzilla

Basic Info
  • Host: GitHub
  • Owner: mozilla
  • License: mpl-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 8.35 MB
Statistics
  • Stars: 7
  • Watchers: 3
  • Forks: 3
  • Open Issues: 1
  • Releases: 0
Created over 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

The dataset

This dataset has been published in "SZZ in the time of Pull Requests" and is a collection of 11991 links from bug-introducing and bug-fixing commit sets extracted from Mozilla's Bugzilla (https://bugzilla.mozilla.org) with the use of bugbug.

In addition, the dataset also contains a set of 8906 links where the fix has not been found yet (so the bug-introducing commit set is known, but the bug-fixing commit set doesn't exist yet).

The dataset was generated by using bugbug, Mozilla's platform for Machine Learning projects on Software Engineering.

Bug IDs refer to Bugzilla bug reports, e.g. 1856572 is https://bugzilla.mozilla.org/show_bug.cgi?id=1856572.

Mercurial hashes refer to commits in the mozilla-central repository, e.g. 3c1db459589a845238abc0359c581fb436a9458f is https://hg.mozilla.org/mozilla-central/rev/3c1db459589a845238abc0359c581fb436a9458f.

Git hashes refer to commits in a clone of the mozilla-central repository using git-cinnabar. Use the guide of git cinnabar for Mozilla and run the following command: git clone hg::https://hg.mozilla.org/mozilla-central.

Terminology

  • Bug-introducing commits: a change or set of changes in one or multiple commits, which introduced a bug in the software.
  • Bug-fixing commits: a change or set of changes in one or multiple commits, which fixed a bug in the software.
  • Regressor: the bug (i.e. its bug-fixing commits) has introduced other bugs.
  • Regression: the bug has been introduced by other bugs (i.e. their bug-fixing commits).
  • There are various bug fields in Bugzilla associated with the above terms, such as Regressed by or Regressions. Please refer to the fields in the Mozilla Wiki.

Note: a bug-fixing change can also be a bug-introducing change, and viceversa. Sometimes developers will fix bugs and introduce new ones in the process. Also, it is possible for a bug to have only either of them, fix commit or bug-introducing commit, as esablishing a link between them is not always easy.

Example usage of the dataset

Run the example.py script with Python to see some high-level statistics about the dataset:

Python python example.py

The output is:

Total number of pairs: 20897 Total number of pairs where both bug-introducing and bug-fix are known: 11991 Number of pairs with no shared files: 3126 Number of pairs where the bug-fix only contains new lines: 1869 Number of pairs where the bug-introducing only contains removed lines: 998 Number of pairs where the bug-introducing is not linked to any commit: 880 Number of bugs which are not fixed yet and where the cause has been identified: 8906 Deciles for the number of commits associated to bug fixes: [0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 2.0] Deciles for the number of commits associated to bug introducing: [1.0, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 6.0, 12.0] Note: The pairs mentioned in the output refer to the pairs of bug-introducing and bug-fixing commit sets. The shared files are also given as identifying the shared files (common modified files) in pairs is one way to link the pairs of bug-introducing and bug-fixing commit sets.

Format

In this repository, you will find the dataset in two alternative formats:

  1. A CSV file containing all the information related to each issue
  2. A JSON file in a format compatible with PySZZ

In the CSV file, each record represents a link between a bug-introducing and a bug-fixing commit set, and it contains the following columns:

  • FIX_ID: the ID of the regression bug (bug linked to the bug-fixing commits) on Bugzilla
  • FIX_COMMITS_MERCURIAL: the list of bug-fixing commit hashes from the original Mozilla mozilla-central repository hosted in Mercurial
  • FIX_COMMITS_GIT: the list of bug-fixing commit hashes on the mozilla-central mirror git repository (see https://github.com/glandium/git-cinnabar)
  • BUG_IDS: the IDs of the regressor bugs (bugs linked to the bug-introducing commits) on Bugzilla
  • BUG_COMMITS_MERCURIAL: the list of bug-introducing commit hashes from the original mozilla-central repository hosted in Mercurial
  • BUG_COMMITS_GIT: the list of fix-inducing commit hashes on the mozilla-central mirror git repository
  • NO_FILE_SHARED: a boolean value. TRUE if no file is shared between the bug-fixing and bug-introducing commit sets (Extrinsic Bug/Ghost Commits), FALSE otherwise.
  • NEW_LINES_ONLY_FIX: a boolean value. TRUE if there are only added lines in the bug-fixing commit-set (Ghost Commits), FALSE otherwise.
  • REMOVE_LINES_ONLY_BUG: a boolean value. TRUE if there are only removed lines in the bug-introducing commit-set (Ghost Commits), FALSE otherwise.
  • NO_BUG: a boolean value. TRUE if there is no commit linked to the regressor bug (Extrinsic Bug), FALSE otherwise.

Note: Please refer to the following work "SZZ in the time of Pull Requests" to know more details about ghost commits, extrinsic bugs, identifying bug-introducing commits and bug-fixing commits, and establishing link between these commit sets.

References

Please cite these works if you use the data in this repo. @misc{petrulio2022szz, title={SZZ in the time of Pull Requests}, author={Fernando Petrulio and David Ackermann and Enrico Fregnan and Gül Calikli and Marco Castelluccio and Sylvestre Ledru and Calixte Denizet and Emma Humphries and Alberto Bacchelli}, year={2022}, eprint={2209.03311}, archivePrefix={arXiv}, primaryClass={cs.SE} }

@software{castelluccio_bugbug, author = {Castelluccio, Marco}, title = {mozilla/bugbug}, month = jan, year = 2024, publisher = {Zenodo}, version = {v0.0.533}, doi = {10.5281/zenodo.4911345}, url = {https://github.com/mozilla/bugbug}, license = {MPL-2.0} }

Owner

  • Name: Mozilla
  • Login: mozilla
  • Kind: organization
  • Location: Mountain View, California

This technology could fall into the right hands.

Citation (CITATION.cff)

cff-version: 1.2.0
title: SZZ in the time of Pull Requests
message: "If you use this dataset, please cite it using the metadata from this file."
type: dataset
doi: 10.48550/arXiv.2209.03311
date-released: 2022-09-07
repository-code: https://github.com/mozilla/regressors-regressions-dataset
license: MPL-2.0
authors:
  - given-names: Fernando
    family-names: Petrulio
    email: fpetrulio@ifi.uzh.ch
    affiliation: University of Zurich
  - given-names: David
    family-names: Ackermann
    email: david.ackermann@uzh.ch
    affiliation: University of Zurich
  - given-names: Enrico
    family-names: Fregnan
    email: fregnan@ifi.uzh.ch
    affiliation: University of Zurich
  - given-names: Gül
    family-names: Çalikli
    affiliation: University of Glasgow
    email: handangul.calikli@glasgow.ac.uk
  - given-names: Marco
    family-names: Castelluccio
    email: mcastelluccio@mozilla.com
    affiliation: Mozilla
    orcid: https://orcid.org/0000-0002-3285-5121
  - given-names: Sylvestre
    family-names: Ledru
    email: sledru@mozilla.com
    affiliation: Mozilla
  - given-names: Calixte
    family-names: Denizet
    email: cdenizet@mozilla.com
    affiliation: Mozilla
  - given-names: Emma
    family-names: Humphries
    email: me@emmah.net
    affiliation: Mozilla
  - given-names: Alberto
    family-names: Bacchelli
    email: bacchelli@ifi.uzh.ch
    affiliation: University of Zurich

GitHub Events

Total
  • Issues event: 1
Last Year
  • Issues event: 1

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 19
  • Total Committers: 2
  • Avg Commits per committer: 9.5
  • Development Distribution Score (DDS): 0.053
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Marco Castelluccio m****o@m****m 18
Pooja Ruhal p****5@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 9 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • marco-c (1)
Pull Request Authors
  • marco-c (3)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pyproject.toml pypi
  • PyDriller 1.12
  • bugbug 0.0.482
  • libmozdata 0.1.82
  • python >=3.9,<3.12
  • tqdm ^4.64.1