naniar

Tidy data structures, summaries, and visualisations for missing data

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: researchgate.net, nature.com
✓
Committers with academic emails
3 of 23 committers (13.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.1%) to scientific vocabulary

Keywords

data-visualisation ggplot2 missing-data missingness r-package tidy-data

Keywords from Contributors

ropensci visualisation rmarkdown grammar latex data-manipulation exploratory-data-analysis codecov coverage coverage-report

Last synced: 6 months ago · JSON representation

Repository

Tidy data structures, summaries, and visualisations for missing data

Basic Info

Host: GitHub
Owner: njtierney
License: other
Language: R
Default Branch: main
Homepage: http://naniar.njtierney.com/
Size: 94.7 MB

Statistics

Stars: 661
Watchers: 18
Forks: 55
Open Issues: 68
Releases: 13

Topics

data-visualisation ggplot2 missing-data missingness r-package tidy-data

Created about 10 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog Contributing License

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# naniar 


[![R-CMD-check](https://github.com/njtierney/naniar/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/njtierney/naniar/actions/workflows/R-CMD-check.yaml)
[![Coverage Status](https://img.shields.io/codecov/c/github/njtierney/naniar/master.svg)](https://app.codecov.io/github/njtierney/naniar?branch=master)
[![CRAN Status Badge](https://www.r-pkg.org/badges/version/naniar)](https://cran.r-project.org/package=naniar)
[![CRAN Downloads Each Month](https://cranlogs.r-pkg.org/badges/naniar)](https://CRAN.R-project.org/package=naniar)
[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)


`naniar` provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. It does this by providing:

- Shadow matrices, a tidy data structure for missing data:
    - `bind_shadow()` and `nabular()`
- Shorthand summaries for missing data: 
    - `n_miss()` and `n_complete()`
    - `pct_miss()`and `pct_complete()`
- Numerical summaries of missing data in variables and cases:
    - `miss_var_summary()` and `miss_var_table()`
    - `miss_case_summary()`, `miss_case_table()`
- Statistical tests of missingness:
    - `mcar_test()` for [Little's (1988)](https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722) missing completely at random (MCAR) test
- Visualisation for missing data:
    - `geom_miss_point()`
    - `gg_miss_var()`
    - `gg_miss_case()`
    - `gg_miss_fct()`

For more details on the workflow and theory underpinning naniar, read the vignette [Getting started with naniar](https://naniar.njtierney.com/articles/naniar.html). 

For a short primer on the data visualisation available in naniar, read the vignette [Gallery of Missing Data Visualisations](https://naniar.njtierney.com/articles/naniar-visualisation.html).

For full details of the package, including 

# Installation

You can install naniar from CRAN:

```{r install-cran, eval = FALSE}
install.packages("naniar")
```

Or you can install the development version on github using `remotes`:

```{r install-github, eval = FALSE}
# install.packages("remotes")
remotes::install_github("njtierney/naniar")
```

# A short overview of naniar

Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from [ggobi](http://ggobi.org/) and [manet](https://zbmath.org/software/3067), which replaces `NA` values with values 10% lower than the minimum value in that variable. This visualisation is provided with the `geom_miss_point()` ggplot2 geom, which we illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.

```{r regular-geom-point}

library(ggplot2)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_point()

```

ggplot2 does not handle these missing values, and we get a warning message about the missing values.

We can instead use `geom_miss_point()` to display the missing data

```{r geom-miss-point}

library(naniar)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_miss_point()

```

`geom_miss_point()` has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive. As it is a ggplot2 geom, it supports features like faceting and other ggplot features.

```{r facet-by-month}

p1 <-
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point() + 
  facet_wrap(~Month, ncol = 2) + 
  theme(legend.position = "bottom")

p1

```

# Data Structures

naniar provides a data structure for working with missing data, the shadow matrix [(Swayne and Buja, 1998)](https://www.researchgate.net/publication/2758672_Missing_Data_in_Interactive_High-Dimensional_Data_Visualization). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”, and variable names are kep the same, with the added suffix “_NA" to the variables.

```{r show-shadow}

head(airquality)

as_shadow(airquality)

```

Binding the shadow data to the data you help keep better track of the missing values. This format is called "nabular", a portmanteau of `NA` and `tabular`. You can bind the shadow to the data using `bind_shadow` or `nabular`:

```{r show-nabular}
bind_shadow(airquality)
nabular(airquality)
```

Using the nabular format helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:

```{r shadow-w-ggplot}

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             fill = Ozone_NA)) + 
  geom_density(alpha = 0.5)

```

And even visualise imputations 

```{r shadow-impute}

airquality %>%
  bind_shadow() %>%
  as.data.frame() %>% 
   simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%
  ggplot(aes(x = Solar.R,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()


```

Or perform an [upset](https://www.nature.com/articles/nmeth.3033) plot - to plot of the combinations of missingness across cases, using the `gg_miss_upset` function

```{r gg-miss-upset}

gg_miss_upset(airquality)

```

naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.

naniar also provides handy visualations for each variable:

```{r gg-miss-var}

gg_miss_var(airquality)

```

Or the number of missings in a given variable at a repeating span

```{r gg-miss-span}
gg_miss_span(pedestrian,
             var = hourly_counts,
             span_every = 1500)
```

You can read about all of the visualisations in naniar in the vignette [Gallery of missing data visualisations using naniar](https://naniar.njtierney.com/articles/naniar-visualisation.html).

naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:

```{r show-miss-helpers}
n_miss(airquality)
n_complete(airquality)
prop_miss(airquality)
prop_complete(airquality)
pct_miss(airquality)
pct_complete(airquality)
```

# Numerical summaries for missing data

naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with `miss_`. Summaries focussing on variables or a single selected variable, start with `miss_var_`, and summaries for cases (the initial collected row order of the data), they start with `miss_case_`. All of these functions that return dataframes also work with dplyr's `group_by()`.

For example, we can look at the number and percent of missings in each case and variable with `miss_var_summary()`, and `miss_case_summary()`, which both return output ordered by the number of missing values.

```{r miss-var-case-summary}

miss_var_summary(airquality)
miss_case_summary(airquality)

```

You could also `group_by()` to work out the number of missings in each variable across the levels within it.

```{r group-by-var}

library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_var_summary()

```

You can read more about all of these functions in the vignette ["Getting Started with naniar"](https://naniar.njtierney.com/articles/naniar.html).

# Statistical tests of missingness

naniar provides `mcar_test()` for [Little's (1988)](https://www.tandfonline.com/doi/abs/10.1080/01621459.1988.10478722) statistical test for missing completely at random (MCAR) data. The null hypothesis in this test is that the data is MCAR, and the test statistic is a chi-squared value. Given the high statistic value and low p-value, we can conclude that the `airquality` data is not missing completely at random:

```{r mcar-test}
mcar_test(airquality)
```

# Contributions

Please note that this project is released with a [Contributor Code of Conduct](https://naniar.njtierney.com/CONDUCT.html). By participating in this project you agree to abide by its terms.

# Future Work

- Extend the `geom_miss_*` family to include categorical variables, Bivariate plots: scatterplots, density overlays
- SQL translation for databases
- Big Data tools (sparklyr, sparklingwater)
- Work well with other imputation engines / processes
- Provide tools for assessing goodness of fit for classical approaches of MCAR, MAR, and MNAR (graphical inference from `nullabor` package)

## Acknowledgements

Firstly, thanks to [Di Cook](https://github.com/dicook) for giving the initial inspiration for the package and laying down the rich theory and literature that the work in naniar is built upon.
Naming credit (once again!) goes to [Miles McBain](https://github.com/milesmcbain). Among various other things, Miles also worked out how to overload the missing data and make it work as a geom. Thanks also to [Colin Fay](https://github.com/ColinFay) for helping me understand tidy evaluation and for features such as `replace_to_na`, `miss_*_cumsum`, and more.

## A note on the name

naniar was previously named `ggmissing` and initially provided a ggplot geom and some other visualisations. `ggmissing` was changed to `naniar` to reflect the fact that this package is going to be bigger in scope, and is not just related to `ggplot2`. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data. 

> ...But _why_ naniar?

Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like [C.S. Lewis's Narnia](https://en.wikipedia.org/wiki/The_Chronicles_of_Narnia) - a different world, hidden away. You go inside, and sometimes it seems like you've spent no time in there but time has passed very quickly, or the opposite. Also, `NA`niar = na in r, and if you so desire, naniar may sound like "noneoya" in an nz/aussie accent. Full credit to @MilesMcbain for the name, and @Hadley for the rearranged spelling.

Owner

Name: Nicholas Tierney
Login: njtierney
Kind: user
Location: lutruwita (Tasmania)
Company: Freelancer

Website: https://www.njtierney.com
Repositories: 351
Profile: https://github.com/njtierney

|| Freelance Statistician and Research Software Engineer | PhD Statistics | Hiker | Runner | Coffee Geek ||

GitHub Events

Total

Issues event: 4
Watch event: 13
Delete event: 2
Issue comment event: 2
Push event: 6
Pull request event: 3
Fork event: 1
Create event: 2

Last Year

Issues event: 4
Watch event: 13
Delete event: 2
Issue comment event: 2
Push event: 6
Pull request event: 3
Fork event: 1
Create event: 2

Committers

Last synced: 12 months ago

All Time

Total Commits: 861
Total Committers: 23
Avg Commits per committer: 37.435
Development Distribution Score (DDS): 0.098

Past Year

Commits: 9
Committers: 1
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Nicholas Tierney	n**y@g**m	777
ColinFay	c**t@c**e	24
olivroy	o**1@h**m	9
Miles McBain	m**n@g**m	7
Hadley Wickham	h**m@g**m	6
Luke Smith	l**i@s**t	6
Carson Sievert	c**1@g**m	4
Dianne Cook	v**t@g**m	4
maksymiuks	3****s	4
Lionel Henry	l**y@g**m	3
Nicholas Tierney	n**y@N**l	2
Christophe Regouby	c**y@a**m	2
Mitchell	m**d@m**u	2
Romain Francois	r**n@p**t	2
Andrew Heiss	a**w@a**m	1
Eric Nantz	t**t@g**m	1
Gordon McDonald	g**7@g**m	1
Jim Hester	j**r@g**m	1
Katrin Leinweber	9****r	1
Kirill Müller	k**r@m**g	1
unknown	m**m@S**u	1
Nicholas Tierney	n**y@i**u	1
Stuart Lacy	s**y@g**m	1

Committer Domains (Top 20 + Academic)

ict1601.ichr.uwa.edu.au: 1 sef-hdr-120549.qut.edu.au: 1 mailbox.org: 1 andrewheiss.com: 1 purrple.cat: 1 monash.edu: 1 airbus.com: 1 sbcglobal.net: 1 colinfay.me: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 104
Total pull requests: 17
Average time to close issues: about 3 years
Average time to close pull requests: about 1 month
Total issue authors: 26
Total pull request authors: 4
Average comments per issue: 2.03
Average comments per pull request: 0.88
Merged pull requests: 15
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 3
Average time to close issues: less than a minute
Average time to close pull requests: 35 minutes
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.5
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

njtierney (72)
SchmidtPaul (2)
AndreaPi (2)
siavash-babaei (2)
jzadra (2)
ColinFay (2)
benifex (2)
dicook (2)
shlid007 (1)
jamesysilber (1)
oloverm (1)
VectorPosse (1)
mkcaulfield (1)
LukasWallrich (1)
romainfrancois (1)

Pull Request Authors

njtierney (13)
olivroy (2)
maksymiuks (1)
tranlevantra (1)

Top Labels

Issue Labels

maintenance (23) Priority 2 (21) visualisation (13) Priority 1 (12) feature (11) Idea (11) planning (9) internals (5) help wanted (5) time-series (5) documentation (4) enhancement (3)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- cran 22,272 last-month
Total docker downloads: 43,137

Total dependent packages: 7
(may contain duplicates)
Total dependent repositories: 45
(may contain duplicates)
Total versions: 23
Total maintainers: 1

cran.r-project.org: naniar

Data Structures, Summaries, and Visualisations for Missing Data

Homepage: https://github.com/njtierney/naniar
Documentation: http://cran.r-project.org/web/packages/naniar/naniar.pdf
License: MIT + file LICENSE
Latest release: 1.1.0
published almost 2 years ago

Versions: 14
Dependent Packages: 7
Dependent Repositories: 43
Downloads: 22,272 Last month
Docker Downloads: 43,137

Rankings

Stargazers count: 0.5%

Forks count: 1.4%

Downloads: 3.2%

Dependent repos count: 3.9%

Average: 4.0%

Docker downloads count: 6.0%

Dependent packages count: 9.1%

Maintainers (1)

nicholas.tierney@gmail.com

Last synced: 6 months ago

proxy.golang.org: github.com/njtierney/naniar

Documentation: https://pkg.go.dev/github.com/njtierney/naniar#section-documentation
License: other
Latest release: v1.1.0
published almost 2 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

conda-forge.org: r-naniar

Homepage: https://github.com/njtierney/naniar
License: MIT
Latest release: 0.6.1
published almost 5 years ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 2

Rankings

Stargazers count: 15.9%

Dependent repos count: 20.1%

Forks count: 24.9%

Average: 28.1%

Dependent packages count: 51.5%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.1.2 depends
UpSetR * imports
dplyr * imports
forcats * imports
ggplot2 * imports
glue * imports
magrittr * imports
norm * imports
purrr * imports
rlang * imports
stats * imports
tibble >= 2.0.0 imports
tidyr * imports
viridis * imports
visdat * imports
Hmisc * suggests
covr * suggests
gdtools * suggests
gridExtra * suggests
here * suggests
imputeTS * suggests
knitr * suggests
rmarkdown * suggests
rpart * suggests
rpart.plot * suggests
simputation * suggests
spelling * suggests
testthat >= 2.1.0 suggests
vdiffr * suggests
wakefield * suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

actions/checkout master composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r master composite