dataverse-r-study

Data and code for a large-scale study on research code quality and execution at Harvard Dataverse.

https://github.com/atrisovic/dataverse-r-study

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, nature.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary

Keywords

code-quality documentation r-language r-programming reproducibility

Last synced: 6 months ago · JSON representation ·

Repository

Data and code for a large-scale study on research code quality and execution at Harvard Dataverse.

Basic Info

Host: GitHub
Owner: atrisovic
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 8.27 MB

Statistics

Stars: 17
Watchers: 4
Forks: 2
Open Issues: 4
Releases: 0

Topics

code-quality documentation r-language r-programming reproducibility

Created over 5 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

A large-scale study on research code quality and execution

This work is published here: https://www.nature.com/articles/s41597-022-01143-6

Step 1. `get-dois`

Code from get-dois enables communication with the Harvard Dataverse repository and collects DOIs of datasets that contain R code.

Step 2. `aws-cli`

The list of DOIs is used to define jobs for the AWS Batch. Code from aws-cli sends these jobs to the batch queue, where they will wait until resources become available for their execution.

Step 3. `docker`

When a job leaves the queue, it instantiates a pre-installed Docker image containing code to retrieve a replication package, executes R code, and collects data. Code from docker prepares the image.

Step 4. `analysis`

All collected data is retrieved and analyzed in analysis.

Figure

Q&A

Before you do any cleaning, 850 scripts produce a library error. Some of those involve referencing a package that has not been loaded. And those you can fix by installing packages, reducing the errors to 'just' 496. All of those are instances when a package failed to load despite including an install.packages() command. Is that right?

Yes, that's correct. More precisely, in the code cleaning step we add if (!require(lib)) install.packages(lib) for all detected libraries in the code. I also tested the code cleaning step by adding just install.packages() or install.packages() & library(), but require() was best performing.

I wanted to look into specific cases that are coded as library errors here, but could not find the file in the dataverse that would allow me to do that. Does that file exist?

Yes! You can see how all the errors were classified here under the heading "Error type".

Do you have a sense to what extent reliance on non-CRAN repositories may account for some of the errors you obtain?

This is a good question and a limitation of our approach. I have previewed a lot of the research code to create the code cleaning step and haven't seen bioconductor and GitHub packages, so my intuition is that it is a small subset, but I cannot be sure.

Is there a way for me to see instances of R scripts that were vs. were not fixed by looking at your posted .csv files?

So the issue of allocating a specific time period for the re-execution on the cloud created the following problem in data collection: For example, out of 10 scripts in the initial re-execution, we'll initially have a result for 9. But after code cleaning, we'll have the result for 6 (as "fixed" code may take more time to re-execute). So we needed to "match" the 6 re-executed scripts in the second run to their result in the first run to see how the result had changed (Fig. 8 in the paper). That was done in this notebook. In the section "Constructing Sankey", you can see how the error changed before and after code cleaning for each file (ie, those are resultx and resulty after merge).

Owner

Name: Ana Trisovic
Login: atrisovic
Kind: user
Location: Cambridge, USA
Company: Harvard University

Website: https://anatrisovic.com
Twitter: atrisovic
Repositories: 10
Profile: https://github.com/atrisovic

Computer Scientist and #Reproducibility Researcher at @HarvardBiostats & @IQSS | previously with @UChicago, @Cambridge_Uni and @LHCbExperiment @CERN

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Trisovic"
  given-names: "Ana"
  orcid: "https://orcid.org/0000-0003-1991-0533"
title: "Replication data and code for a large-scale study on research code quality and execution"
doi: 10.7910/DVN/UZLXSZ
date-released: 2021-03-23
url: "https://github.com/atrisovic/dataverse-r-study"

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 7
Total pull requests: 9
Average time to close issues: over 1 year
Average time to close pull requests: 1 day
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 0.71
Average comments per pull request: 0.0
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 9

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

atrisovic (4)
gaborcsardi (2)
nuest (1)

Pull Request Authors

dependabot[bot] (9)

Top Labels

Issue Labels

Pull Request Labels

dependencies (9)

Dependencies

requirements.txt pypi

Jinja2 ==2.11.3
MarkupSafe ==1.1.1
PyYAML ==5.4
Pygments ==2.7.4
Send2Trash ==1.5.0
appnope ==0.1.0
attrs ==19.1.0
awscli ==1.16.253
backports-abc ==0.5
backports.functools-lru-cache ==1.5
backports.shutil-get-terminal-size ==1.0.0
bleach ==3.3.0
boto3 ==1.9.243
botocore ==1.12.243
certifi ==2019.9.11
chardet ==3.0.4
colorama ==0.4.1
configparser ==4.0.2
cycler ==0.10.0
decorator ==4.4.0
defusedxml ==0.6.0
docutils ==0.15.2
entrypoints ==0.3
enum34 ==1.1.6
functools32 ==3.2.3.post2
futures ==3.3.0
idna ==2.8
ipaddress ==1.0.22
ipykernel ==4.10.1
ipython ==7.16.3
ipython-genutils ==0.2.0
ipywidgets ==7.5.1
jmespath ==0.9.4
jsonschema ==3.0.2
jupyter ==1.0.0
jupyter-client ==5.3.2
jupyter-console ==5.2.0
jupyter-core ==4.5.0
kiwisolver ==1.1.0
matplotlib ==2.2.4
mistune ==0.8.4
nbconvert ==5.6.0
nbformat ==4.4.0
notebook ==6.4.1
numpy ==1.21.0
pandas ==0.24.2
pandocfilters ==1.4.2
pathlib2 ==2.3.4
pexpect ==4.7.0
pickleshare ==0.7.5
prometheus-client ==0.7.1
prompt-toolkit ==1.0.16
ptyprocess ==0.6.0
pyasn1 ==0.4.7
pyparsing ==2.4.2
pyrsistent ==0.15.4
python-dateutil ==2.8.0
pytz ==2019.2
pyzmq ==18.1.0
qtconsole ==4.5.5
requests ==2.22.0
rsa ==4.7
s3transfer ==0.2.1
scandir ==1.10.0
scipy ==1.2.2
seaborn ==0.9.0
simplegeneric ==0.8.1
singledispatch ==3.4.0.3
six ==1.12.0
subprocess32 ==3.5.4
terminado ==0.8.2
testpath ==0.4.2
tornado ==5.1.1
traitlets ==4.3.2
urllib3 ==1.26.5
wcwidth ==0.1.7
webencodings ==0.5.1
widgetsnbextension ==3.5.1

docker/Dockerfile docker

continuumio/miniconda latest build

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

dataverse-r-study

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

A large-scale study on research code quality and execution

Step 1. `get-dois`

Step 2. `aws-cli`

Step 3. `docker`

Step 4. `analysis`

Figure

Q&A

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

dataverse-r-study

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

A large-scale study on research code quality and execution

Step 1. get-dois

Step 2. aws-cli

Step 3. docker

Step 4. analysis

Figure

Q&A

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Step 1. `get-dois`

Step 2. `aws-cli`

Step 3. `docker`

Step 4. `analysis`