janitor

simple tools for data cleaning in R

https://github.com/sfirke/janitor

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 36 committers (5.6%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (19.1%) to scientific vocabulary

Keywords

data-analysis data-cleaning data-science dirty-data excel pivot-tables r spss tabulations tidyverse

Keywords from Contributors

visualisation tidy-data package-creation shiny data-manipulation setup devtools unit-testing parsing strings

Last synced: 9 months ago · JSON representation

Repository

simple tools for data cleaning in R

Basic Info

Host: GitHub
Owner: sfirke
License: other
Language: R
Default Branch: main
Homepage: http://sfirke.github.io/janitor/
Size: 8.2 MB

Statistics

Stars: 1,419
Watchers: 35
Forks: 134
Open Issues: 39
Releases: 13

Topics

data-analysis data-cleaning data-science dirty-data excel pivot-tables r spss tabulations tidyverse

Created about 10 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License

README.Rmd

---
output:
  github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
options(width = 110)
```

# janitor 


> Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
> 
> -- ["For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight"](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) *(New York Times, 2014)*


***********************


[![R-CMD-check](https://github.com/sfirke/janitor/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/sfirke/janitor/actions/workflows/R-CMD-check.yaml)
[![Coverage Status](https://img.shields.io/codecov/c/github/sfirke/janitor/main.svg)](https://app.codecov.io/github/sfirke/janitor?branch=main)
[![lifecycle](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version-ago/janitor)](https://cran.r-project.org/package=janitor)
![!Monthly Downloads](https://cranlogs.r-pkg.org/badges/janitor)
![!Downloads](https://cranlogs.r-pkg.org/badges/grand-total/janitor)


**janitor** has simple functions for examining and cleaning dirty data.  It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can perform many of these tasks already, but with janitor they can do it faster and save their thinking for the fun stuff.

The main janitor functions:

* perfectly format data.frame column names;
* create and format frequency tables of one, two, or three variables - think an improved `table()`; and
* provide other tools for cleaning and examining data.frames.

The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.

janitor is a [#tidyverse]( https://cran.r-project.org/package=tidyverse/vignettes/manifesto.html)-oriented package.  Specifically, it plays nicely with the `%>%` pipe and is optimized for cleaning data brought in with the [readr](https://github.com/tidyverse/readr) and [readxl](https://github.com/tidyverse/readxl) packages.


##  Installation

You can install:

* the most recent officially-released version from CRAN with

```r
install.packages("janitor")
```

* the latest development version from GitHub with

```R
# install.packages("remotes")
remotes::install_github("sfirke/janitor")
# or from r-universe
install.packages("janitor", repos = c("https://sfirke.r-universe.dev", "https://cloud.r-project.org"))
```

## Using janitor

A full description of each function, organized by topic, can be found in janitor's [catalog of functions vignette](https://sfirke.github.io/janitor/articles/janitor.html).  There you will find functions not mentioned in this README, like `compare_df_cols()` which provides a summary of differences in column names and types when given a set of data.frames.

Below are quick examples of how janitor tools are commonly used.    

### Cleaning dirty data

Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file [dirty_data.xlsx](https://github.com/sfirke/janitor/blob/main/dirty_data.xlsx):
![All kinds of dirty.](man/figures/dirty_data.PNG)

Dirtiness includes:

* A header at the top
* Dreadful column names
* Rows and columns containing Excel formatting but no data
* Dates in two different formats in a single column (MM/DD/YYYY and numbers)
* Values spread inconsistently over the "Certification" columns

Here's that data after being read in to R:
```{r, warning = FALSE, message = FALSE}
library(readxl)
library(janitor)
library(dplyr)
library(here)

roster_raw <- read_excel(here("dirty_data.xlsx")) # available at https://github.com/sfirke/janitor
glimpse(roster_raw)
```

Now, to clean it up, starting with the column names.

Name cleaning comes in two flavors. `make_clean_names()` operates on character vectors and can be used during data import:
```{r, warning = FALSE, message = FALSE}
roster_raw_cleaner <- read_excel(here("dirty_data.xlsx"),
  skip = 1,
  .name_repair = make_clean_names
)
glimpse(roster_raw_cleaner)
```

`clean_names()` is a convenience version of `make_clean_names()` that can be used for piped data.frame workflows.  The equivalent steps with `clean_names()` would be:

```{r, warning = FALSE}
roster_raw <- roster_raw %>%
  row_to_names(row_number = 1) %>%
  clean_names()
```

The data.frame now has clean names.  Let's tidy it up further:

```{r}
roster <- roster_raw %>%
  remove_empty(c("rows", "cols")) %>%
  remove_constant(na.rm = TRUE, quiet = FALSE) %>% # remove the column of all "Yes" values
  mutate(
    hire_date = convert_to_date(
      hire_date, # handle the mixed-format dates
      character_fun = lubridate::mdy
    ),
    cert = dplyr::coalesce(certification, certification_2)
  ) %>%
  select(-certification, -certification_2) # drop unwanted columns

roster
```


### Examining dirty data

#### Finding duplicates
Use `get_dupes()` to identify and examine duplicate records during data cleaning.  Let's see if any teachers are listed more than once:
```{r}
roster %>% get_dupes(contains("name"))
```

Yes, some teachers appear twice.  We ought to address this before counting employees.

#### Tabulating tools
A variable (or combinations of two or three variables) can be tabulated with `tabyl()`.  The resulting data.frame can be tweaked and formatted
with the suite of `adorn_` functions for quick analysis and printing of pretty results in a report.  `adorn_` functions can be helpful with non-tabyls, too.

#### `tabyl()`

Like `table()`, but pipe-able, data.frame-based, and fully featured.

`tabyl()` can be called two ways:

* On a vector, when tabulating a single variable: `tabyl(roster$subject)`
* On a data.frame, specifying 1, 2, or 3 variable names to tabulate: `roster %>% tabyl(subject, employee_status)`.
    * Here the data.frame is passed in with the `%>%` pipe; this allows `tabyl` to be used in an analysis pipeline
    
One variable:
```{r}
roster %>%
  tabyl(subject)
```

Two variables:
```{r}
roster %>%
  filter(hire_date > as.Date("1950-01-01")) %>%
  tabyl(employee_status, full_time)
```

Three variables:
```{r}
roster %>%
  tabyl(full_time, subject, employee_status, show_missing_levels = FALSE)
```

#### Adorning tabyls
The `adorn_` functions dress up the results of these tabulation calls for fast, basic reporting.  Here are some of the functions that augment a summary table for reporting:

```{r}
roster %>%
  tabyl(employee_status, full_time) %>%
  adorn_totals("row") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined")
```

Pipe that right into `knitr::kable()` in your RMarkdown report.

These modular adornments can be layered to reduce R's deficit against Excel and SPSS when it comes to quick, informative counts.  Learn more about `tabyl()` and the `adorn_` functions from the [tabyls vignette](https://sfirke.github.io/janitor/articles/tabyls.html).

##  Contact me

You are welcome to:

* submit suggestions and report bugs: https://github.com/sfirke/janitor/issues
* let me know what you think on Mastodon: [@samfirke@a2mi.social](https://a2mi.social/@samfirke)
* compose a friendly e-mail to:

Owner

Name: Sam Firke
Login: sfirke
Kind: user
Location: Ann Arbor, MI
Company: City of Ann Arbor

Website: samfirke.com
Repositories: 3
Profile: https://github.com/sfirke

Data scientist, caring human. Current: municipal data analysis and BI in SQL, Apache Superset, and Python. Previously: #rstats all day.

GitHub Events

Total

Create event: 6
Release event: 1
Issues event: 14
Watch event: 36
Delete event: 2
Issue comment event: 40
Push event: 30
Pull request review event: 6
Pull request review comment event: 6
Pull request event: 9
Fork event: 5

Last Year

Create event: 6
Release event: 1
Issues event: 14
Watch event: 36
Delete event: 2
Issue comment event: 40
Push event: 30
Pull request review event: 6
Pull request review comment event: 6
Pull request event: 9
Fork event: 5

Committers

Last synced: 11 months ago

All Time

Total Commits: 909
Total Committers: 36
Avg Commits per committer: 25.25
Development Distribution Score (DDS): 0.194

Past Year

Commits: 6
Committers: 4
Avg Commits per committer: 1.5
Development Distribution Score (DDS): 0.667

Top Committers

Name	Email	Commits
Sam Firke	s**e@g**m	733
Bill Denney	b****y	64
Matan	m**m@g**m	22
JosiahParry	j**y@g**m	11
Chris Haid	c**d@k**g	10
olivroy	5****y	10
Ryan Knight	r**t@g**m	9
Julien	j**n@n**g	7
Tazinho	m**r@g**m	6
khueyama	k**a@g**m	3
Jonathan Zadra	j**a@s**m	2
Matt	m**a@g**m	2
Henry Naish	5****7	2
fernando	f**o@g**m	2
Romain François	r**n@r**m	2
jsta	s**2@m**u	2
khueyama	k**a@u**g	2
Jason Aizkalns	j**s@g**m	2
Bernie Gray	b****3	1
=	j**y@n**l	1
Kyle Haynes	k**s@t**u	1
Dan Chaltiel	d**l@g**m	1
Daniel Barnett	1****t	1
Francis Barton	f**n@g**m	1
Garth Tarr	g**r@g**m	1
JJSteph	J****h	1
Jonathan Leslie	3****e	1
Josep Pueyo-Ros	5****o	1
Kevin Gilds	k**s@g**m	1
Kirill Müller	k****r	1
and 6 more...

Committer Domains (Top 20 + Academic)

uni-hamburg.de: 1 treasury.qld.gov.au: 1 urban.org: 1 msu.edu: 1 rstudio.com: 1 sorensonimpact.com: 1 nozav.org: 1 kippchicago.org: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 95
Total pull requests: 62
Average time to close issues: 10 months
Average time to close pull requests: about 1 month
Total issue authors: 54
Total pull request authors: 9
Average comments per issue: 4.27
Average comments per pull request: 3.35
Merged pull requests: 54
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 11
Pull requests: 6
Average time to close issues: 3 days
Average time to close pull requests: about 12 hours
Issue authors: 8
Pull request authors: 3
Average comments per issue: 1.64
Average comments per pull request: 2.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

sfirke (24)
billdenney (12)
olivroy (3)
larry77 (3)
mgacc0 (2)
matanhakim (2)
jzadra (2)
CITESmike2018 (1)
AltfunsMA (1)
cstepper (1)
daranzolin (1)
panporter (1)
eauleaf (1)
statzhero (1)
francisbarton (1)

Pull Request Authors

billdenney (30)
sfirke (17)
olivroy (13)
matanhakim (4)
jospueyo (2)
JasonAizkalns (2)
DanChaltiel (1)
mgacc0 (1)
lionel- (1)

Top Labels

Issue Labels

seeking comments (8) bug (3) next-release (2) pull-request-welcome (2) in progress (1)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- cran 239,638 last-month
Total docker downloads: 175,600

Total dependent packages: 138
(may contain duplicates)
Total dependent repositories: 581
(may contain duplicates)
Total versions: 31
Total maintainers: 1

cran.r-project.org: janitor

Simple Tools for Examining and Cleaning Dirty Data

Homepage: https://github.com/sfirke/janitor
Documentation: http://cran.r-project.org/web/packages/janitor/janitor.pdf
License: MIT + file LICENSE
Latest release: 2.2.1
published over 1 year ago

Versions: 14
Dependent Packages: 130
Dependent Repositories: 573
Downloads: 239,638 Last month
Docker Downloads: 175,600

Rankings

Stargazers count: 0.2%

Forks count: 0.5%

Dependent repos count: 0.6%

Dependent packages count: 0.8%

Downloads: 1.2%

Average: 3.8%

Docker downloads count: 19.8%

Maintainers (1)

samuel.firke@gmail.com

Last synced: 10 months ago

proxy.golang.org: github.com/sfirke/janitor

Documentation: https://pkg.go.dev/github.com/sfirke/janitor#section-documentation
License: other
Latest release: v2.2.1+incompatible
published over 1 year ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.7%

Dependent repos count: 5.9%

Last synced: 10 months ago

conda-forge.org: r-janitor

Homepage: https://github.com/sfirke/janitor
License: MIT
Latest release: 2.1.0
published over 5 years ago

Versions: 8
Dependent Packages: 8
Dependent Repositories: 8

Rankings

Dependent packages count: 7.1%

Stargazers count: 11.4%

Average: 11.7%

Dependent repos count: 12.1%

Forks count: 16.3%

Last synced: 10 months ago

Dependencies

DESCRIPTION cran

R >= 3.1.2 depends
dplyr >= 1.0.0 imports
hms * imports
lifecycle * imports
lubridate * imports
magrittr * imports
purrr * imports
rlang * imports
snakecase >= 0.9.2 imports
stringi * imports
stringr * imports
tidyr >= 0.7.0 imports
tidyselect >= 1.0.0 imports
knitr * suggests
rmarkdown * suggests
sf * suggests
testthat * suggests
tibble * suggests
tidygraph * suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v3 composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action v4.4.1 composite
actions/checkout v3 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/test-coverage.yaml actions

actions/checkout v3 composite
actions/upload-artifact v3 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite