dfmirror

A repo for the dfmirroR package

https://github.com/jacobpstein/dfmirror

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.2%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

A repo for the dfmirroR package

Basic Info
  • Host: GitHub
  • Owner: jacobpstein
  • License: other
  • Language: R
  • Default Branch: main
  • Size: 494 KB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 0
  • Open Issues: 4
  • Releases: 3
Created over 2 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
  output: github_document
---
  

  
```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# dfmirroR 


[![CRAN status](https://www.r-pkg.org/badges/version/dfmirroR)](https://CRAN.R-project.org/package=dfmirroR)

  
The goal of dfmirroR is to create mirrored version of data sets *and* output a string with the code to reproduce that copy. Data scientists often have questions about analyzing a specific data set, but in many cases cannot share their data. 

*dfmirrorR* creates a copy of the data based on the distribution of specified columns. In recognition that we also often have questions we want to post publicly, and the need to create reproducable examples, the package also has functionality for outputting a simplified, pasteable version of code for creating the mirrored data frame object. 

One neat thing about dfmirrorR is that it tests whether or not columns are normally distributed and mirrors the specified columns accordingly so that your "fake" data resembles your original data.

## Installation 

You can install the development version of dfmirroR from [GitHub](https://github.com/) with:
  
```{r message = FALSE, warning = FALSE}
# install.packages("devtools")
devtools::install_github("jacobpstein/dfmirroR")
```


You can install also the CRAN version of the package, but it's not as good as the development version and some features need to be submitted still:

```{r message = FALSE, warning = FALSE }
install.packages("dfmirroR", repos = "http://cran.us.r-project.org")
```

## Example

This is a basic example which shows you how to solve a common problem. Let's say you are working with the `airquality` dataset. This contains a `Wind` column that is approximately normal based on a Shapiro-Wilk test and another column `Ozone`, which is non-normally distributed. You want to simulate a data set to test a model and need to mirror `airquality` but with more observations and then create a reproducible example.

Here's what the `Ozone` column looks like in the original data:
```{r example1, warning = FALSE}
library(dfmirroR)
library(ggplot2)

data(airquality)

# take a look at the Ozone variable

ggplot(airquality) +
  geom_histogram(aes(Ozone), col = "white", fill = "#AFDFEF", bins = 30) +
  theme_minimal() +
  labs(title = "Distribution of 153 Ozone observations from the airquality dataset")


```

Now, let's run `dfmirrorR` to create a similar column.

```{r example2}

# set a seed
set.seed(3326)

air_mirror <- simulate_dataframe(airquality, num_obs = 1000, columns_to_simulate = c("Ozone", "Wind"))
```

This creates a `list()` object that contains a new data frame with 1,000 observations based on the distributions of the `Ozone` and `Wind` columns in the `input_df`.  

Take a look at the mirrored colum for Ozone:
```{r example3, warning = FALSE}

ggplot(air_mirror$simulated_df) +
  geom_histogram(aes(Ozone), col = "white", fill = "#AFDFEF", bins = 30) +
  theme_minimal() +
  labs(title = "Distribution of 1,000 Ozone observations from a mirrored dataset")

```

## Print code to share your simulated data

There are other packages that can mirror a dataframe. The excellent [`faux`](https://debruine.github.io/faux/) comes to mind. However, one addition of the `dfmirroR` package is that it prints code to add to a reproducible example if you need to ask a question on [Stackoverflow](https://stackoverflow.com) or elsewhere. 

For example, from our `air_mirror` list object above, we can extract the `code` object, which is just a string containing the relevant code. Combining this object with the `cat()` function provides clean, easily shareable output.

```{r example4}

cat(air_mirror$code)

```

### Citations
This package is indebted to the great [`fitdistrplus`](https://CRAN.R-project.org/package=fitdistrplus) package, which allows `dfmirroR` to dynamically mimic the distribution of input data. For more, see: 

Marie Laure Delignette-Muller, Christophe Dutang (2015). *fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software*. https://www.jstatsoft.org/article/view/v064i04 DOI 10.18637/jss.v064.i04.

This package relies on the `skewness` function from:
David Meyer, et al. [e1071](https://CRAN.R-project.org/package=e1071).

The `MASS` package also provides some functionality. Learn more here:
Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York. ISBN 0-387-95457-0, https://www.stats.ox.ac.uk/pub/MASS4/

This package also pulls from the core R [`stats`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html) package. Special thanks to the R Core Team, without whom I would almost definitely be unemployed. 

Owner

  • Login: jacobpstein
  • Kind: user

GitHub Events

Total
  • Release event: 1
  • Watch event: 2
  • Push event: 3
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 2
  • Push event: 3
  • Create event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 3
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jacobpstein (3)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 267 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
cran.r-project.org: dfmirroR

Simulate a Data Frame Mirroring an Input and Produce Shareable Simulation Code

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 267 Last month
Rankings
Dependent packages count: 28.4%
Dependent repos count: 36.4%
Average: 49.9%
Downloads: 84.9%
Maintainers (1)
Last synced: 10 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pr-commands.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/pr-fetch v2 composite
  • r-lib/actions/pr-push v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/r.yml actions
  • actions/checkout v3 composite
  • r-lib/actions/setup-r f57f1301a053485946083d7a45022b278929a78a composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 2.10 depends
  • fitdistrplus * imports
  • stats * imports
  • testthat >= 3.0.0 suggests