SmartEDA

SmartEDA: An R Package for Automated Exploratory Data Analysis - Published in JOSS (2019)

https://github.com/daya6489/smarteda

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: arxiv.org, joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

analysis exploratory-data-analysis
Last synced: 6 months ago · JSON representation

Repository

a R package for data exploratory analysis

Basic Info
Statistics
  • Stars: 45
  • Watchers: 4
  • Forks: 14
  • Open Issues: 1
  • Releases: 0
Topics
analysis exploratory-data-analysis
Created almost 7 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog License Code of conduct

README.md

SmartEDA CRAN status

Downloads status Total Downloads GitHub Stars


Background

In a quality statistical data analysis the initial step has to be exploratory. Exploratory data analysis begins with the univariate exploratory analyis - examining the variable one at a time. Next comes bivariate analysis followed by multivariate analyis. SmartEDA package helps in getting the complete exploratory data analysis just by running the function instead of writing lengthy r code.

Functionalities of SmartEDA

The SmartEDA R package has four unique functionalities as

  • Descriptive statistics
  • Data visualization
  • Custom table
  • HTML EDA report

SmartEDA

Comparison with other packages

SmartEDA package with other similar packages available in CRAN for exploratory data analysis viz. dlookr, DataExplorer, Hmisc, exploreR, RtutoR and summarytools. The metric for evaluation is the availability of various desired features for performing an Exploratory data analysis

SmartEDA

Journal of Open Source Software Article

An article describing SmartEDA package for exploratory data analysis approach has been published in arxiv and Journal of Open Source Software JOSS. Please cite the paper if you use SmartEDA in your work!

Installation

The package can be installed directly from CRAN.

R install.packages("SmartEDA")

You can install the latest development verion of the SmartEDA from github with:

R install.packages("devtools") devtools::install_github("daya6489/SmartEDA",ref = "develop")

Example

Data

In this vignette, we will be using a simulated data set containing sales of child car seats at 400 different stores.

Data Source ISLR package.

Install the package "ISLR" to get the example data set.

R install.packages("ISLR") library("ISLR") install.packages("SmartEDA") library("SmartEDA") ## Load sample dataset from ISLR pacakge Carseats= ISLR::Carseats

Overview of the data

Understanding the dimensions of the data set, variable names, overall missing summary and data types of each variables

```R

overview of the data;

ExpData(data=Carseats,type=1)

structure of the data

ExpData(data=Carseats,type=2)

```

Summary of numerical variables

To summarise the numeric variables, you can use following r codes from this pacakge

```R

Summary statistics by – overall

ExpNumStat(Carseats,by="A",gp=NULL,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)

Summary statistics by – overall with correlation

ExpNumStat(Carseats,by="A",gp="Price",Qnt=seq(0,1,0.1),MesofShape=1,Outlier=TRUE,round=2)

Summary statistics by – category

ExpNumStat(Carseats,by="GA",gp="Urban",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)

```

weighted summary for numerical variables

R ExpNumStat(mtcars,by="A",round=2, weight = "wt")

Graphical representation of all numeric features

```R

Generate Boxplot by category

ExpNumViz(mtcars,target="gear",type=2,nlim=25,fname = file.path(tempdir(),"Mtcars2"),Page = c(2,2))

Generate Density plot

ExpNumViz(mtcars,target=NULL,type=3,nlim=25,fname = file.path(tempdir(),"Mtcars3"),Page = c(2,2))

Generate Scatter plot

ExpNumViz(mtcars,target="carb",type=3,nlim=25,fname = file.path(tempdir(),"Mtcars4"),Page = c(2,2))

```

Summary of Categorical variables

```R

Frequency or custom tables for categorical variables

ExpCTable(Carseats,Target=NULL,margin=1,clim=10,nlim=5,round=2,bin=NULL,per=T)
ExpCTable(Carseats,Target="Price",margin=1,clim=10,nlim=NULL,round=2,bin=4,per=F)
ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=NULL,round=2,bin=NULL,per=F)    

Summary statistics of categorical variables

ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=10,nlim=5,Pclass="Yes")

Inforamtion value and Odds value

ExpCatStat(Carseats,Target="Urban",result = "IV",clim=10,nlim=5,Pclass="Yes")

```

weighted count for categorical variables

R ExpCTable(mtcars, margin = 1, clim = 10, nlim = 3, bin = NULL, per = FALSE, weight = "wt"")

Graphical representation of all categorical variables

```R

column chart

ExpCatViz(Carseats,target="Urban",fname=NULL,clim=10,col=NULL,margin=2,Page = c(2,1),sample=2)

Stacked bar graph

ExpCatViz(Carseats,target="Urban",fname=NULL,clim=10,col=NULL,margin=2,Page = c(2,1),sample=2)

Variable importance graph using information values

ExpCatStat(Carseats,Target="Urban",result="Stat",Pclass="Yes",plot=TURE,top=20,Round=2) ```

Variable importance based on Information value

R ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=10,nlim=5,bins=10,Pclass="Yes",plot=TRUE,top=10,Round=2)

Create HTML EDA report

Create a exploratory data analysis report in HTML format

R ExpReport(Carseats,Target="Urban",label=NULL,op_file="test.html",op_dir=getwd(),sc=2,sn=2,Rc="Yes")

Quantile-quantile plot for numeric variables

R ExpOutQQ(CData,nlim=10,fname=NULL,Page=c(2,2),sample=4) <!-- -->

Parallel Co-ordinate plots

```R

Defualt ExpParcoord funciton

ExpParcoord(CData,Group=NULL,Stsize=NULL,Nvar=c("Price","Income","Advertising","Population","Age","Education"))

With Stratified rows and selected columns only

ExpParcoord(CData,Group="ShelveLoc",Stsize=c(10,15,20),Nvar=c("Price","Income"),Cvar=c("Urban","US"))

Without stratification

ExpParcoord(CData,Group="ShelveLoc",Nvar=c("Price","Income"),Cvar=c("Urban","US"),scale=NULL)

Scale change

ExpParcoord(CData,Group="US",Nvar=c("Price","Income"),Cvar=c("ShelveLoc"),scale="std")

Selected numeric variables

ExpParcoord(CData,Group="ShelveLoc",Stsize=c(10,15,20),Nvar=c("Price","Income","Advertising","Population","Age","Education"))

Selected categorical variables

ExpParcoord(CData,Group="US",Stsize=c(15,50),Cvar=c("ShelveLoc","Urban")) ``` <!-- -->

Two independent plots side by side for the same variable

To plot graph from same variable when Target=NULL vs. when Target = categorical variable (binary or multi-class variable)

```R target = "gear" categoricalfeatures <- c("vs", "am", "carb") numeircalfeatures <- c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec")

num1 <- ExpTwoPlots(mtcars, plottype = "numeric", ivvariables = numeircalfeatures, target = "gear", lparglist = list(alpha=0.5, color = "red", fill= "white", binwidth=1), lpgeomtype = 'histogram', rparglist = list(alpha=0.5, fill = c("red", "orange", "pink"), binwidth=1), rpgeomtype = 'histogram', fname = "dub2.pdf", page = c(2,1), theme = "Default")

``` <!-- -->

Univariate Outlier analysis

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.An outlier can cause serious problems in statistical analyses.

Identifying outliers: There are several methods we can use to identify outliers. In ExpOutliers used two methods (1) Boxplot and (2) Standard Deviation

SmartEDA

```R

Identifying outliers mehtod - Boxplot

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "boxplot", capping = c(0.1, 0.9))

Identifying outliers mehtod - 3 Standard Deviation

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev", capping = c(0.1, 0.9))

Identifying outliers mehtod - 2 Standard Deviation

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "2xStDev", capping = c(0.1, 0.9))

Create outlier flag (1,0) if there are any outliers

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev", capping = c(0.1, 0.9), outflag = TRUE)

Impute outlier value by mean or median valie

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev", treatment = "mean", capping = c(0.1, 0.9), outflag = TRUE)

```

Exploratory analysis - Custom tables, summary statistics

Descriptive summary on all input variables for each level/combination of group variable. Also while running the analysis we can filter row/cases of the data.

R ExpCustomStat(Carseats,Cvar=c("US","Urban","ShelveLoc"),gpby=FALSE) ExpCustomStat(Carseats,Cvar=c("US","Urban"),gpby=TRUE,filt=NULL) ExpCustomStat(Carseats,Cvar=c("US","Urban","ShelveLoc"),gpby=TRUE,filt=NULL) ExpCustomStat(Carseats,Cvar=c("US","Urban"),gpby=TRUE,filt="Population>150") ExpCustomStat(Carseats,Cvar=c("US","ShelveLoc"),gpby=TRUE,filt="Urban=='Yes' & Population>150") ExpCustomStat(Carseats,Nvar=c("Population","Sales","CompPrice","Income"),stat = c('Count','mean','sum','var','min','max')) ExpCustomStat(Carseats,Nvar=c("Population","Sales","CompPrice","Income"),stat = c('min','p0.25','median','p0.75','max')) ExpCustomStat(Carseats,Nvar=c("Population","Sales","CompPrice","Income"),stat = c('Count','mean','sum','var'),filt="Urban=='Yes'") ExpCustomStat(Carseats,Nvar=c("Population","Sales","CompPrice","Income"),stat = c('Count','mean','sum'),filt="Urban=='Yes' & Population>150") ExpCustomStat(data_sam,Nvar=c("Population","Sales","CompPrice","Income"),stat = c('Count','mean','sum','min'),filt="All %ni% c(999,-9)") ExpCustomStat(Carseats,Nvar=c("Population","Sales","CompPrice","Education","Income"),stat = c('Count','mean','sum','var','sd','IQR','median'),filt=c("ShelveLoc=='Good'^Urban=='Yes'^Price>=150^ ^US=='Yes'")) ExpCustomStat(Carseats,Cvar = c("Urban","ShelveLoc"), Nvar=c("Population","Sales"), stat = c('Count','Prop','mean','min','P0.25','median','p0.75','max'),gpby=FALSE) ExpCustomStat(Carseats,Cvar = c("Urban","US","ShelveLoc"), Nvar=c("CompPrice","Income"), stat = c('Count','Prop','mean','sum','PS','min','max','IQR','sd'), gpby = TRUE) ExpCustomStat(Carseats,Cvar = c("Urban","US","ShelveLoc"), Nvar=c("CompPrice","Income"), stat = c('Count','Prop','mean','sum','PS','P0.25','median','p0.75'), gpby = TRUE,filt="Urban=='Yes'") ExpCustomStat(data_sam,Cvar = c("Urban","US","ShelveLoc"), Nvar=c("Sales","CompPrice","Income"), stat = c('Count','Prop','mean','sum','PS'), gpby = TRUE,filt="All %ni% c(888,999)") ExpCustomStat(Carseats,Cvar = c("Urban","US"), Nvar=c("Population","Sales","CompPrice"), stat = c('Count','Prop','mean','sum','var','min','max'), filt=c("ShelveLoc=='Good'^Urban=='Yes'^Price>=150"))

Articles

See article wiki page.

JOSS Publication

SmartEDA: An R Package for Automated Exploratory Data Analysis
Published
September 04, 2019
Volume 4, Issue 41, Page 1509
Authors
Sayan Putatunda ORCID
VMware Software India Pvt ltd.
Dayananda Ubrangala
VMware Software India Pvt ltd.
Kiran Rama
VMware Software India Pvt ltd.
Ravi Kondapalli
VMware Software India Pvt ltd.
Editor
Melissa Gymrek ORCID
Tags
Exploratory Data Analysis Data Mining

GitHub Events

Total
  • Watch event: 4
  • Fork event: 1
Last Year
  • Watch event: 4
  • Fork event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 33
  • Total Committers: 3
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.182
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
daya6489 d****9@g****m 27
Dayananda Ubrangala d****a@v****m 3
Dayananda Ubrangala d****a@m****m 3
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 12
  • Total pull requests: 0
  • Average time to close issues: 5 months
  • Average time to close pull requests: N/A
  • Total issue authors: 12
  • Total pull request authors: 0
  • Average comments per issue: 2.17
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • fkohrt (1)
  • lewishounkpevi (1)
  • sausagesky (1)
  • mafaldavs (1)
  • bvittrant (1)
  • jpiversen (1)
  • Vaishnavikumara (1)
  • tlilburn (1)
  • SugarRayLua (1)
  • araikes (1)
  • pfv07 (1)
  • wahid18benz (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 831 last-month
  • Total dependent packages: 1
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 14
  • Total maintainers: 1
cran.r-project.org: SmartEDA

Summarize and Explore the Data

  • Versions: 13
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 831 Last month
Rankings
Downloads: 3.0%
Forks count: 6.3%
Stargazers count: 8.3%
Average: 11.9%
Dependent packages count: 18.2%
Dependent repos count: 23.9%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-smarteda
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 41.8%
Average: 42.0%
Forks count: 42.1%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.3.0 depends
  • GGally * imports
  • ISLR >= 1.0 imports
  • data.table * imports
  • ggplot2 * imports
  • gridExtra * imports
  • qpdf * imports
  • rmarkdown * imports
  • sampling * imports
  • scales * imports
  • DataExplorer * suggests
  • covr * suggests
  • knitr * suggests
  • psych * suggests
  • testthat * suggests