correlationfunnel

Speed Up Exploratory Data Analysis (EDA)

https://github.com/business-science/correlationfunnel

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.8%) to scientific vocabulary

Keywords

correlation exploratory-analysis exploratory-data-analysis exploratory-data-visualizations r-package tidyverse
Last synced: 6 months ago · JSON representation

Repository

Speed Up Exploratory Data Analysis (EDA)

Basic Info
Statistics
  • Stars: 139
  • Watchers: 9
  • Forks: 29
  • Open Issues: 7
  • Releases: 0
Topics
correlation exploratory-analysis exploratory-data-analysis exploratory-data-visualizations r-package tidyverse
Created over 6 years ago · Last pushed about 2 years ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  dpi = 300,
  message = F,
  warning = F
)

devtools::load_all()
library(tidyverse)
```


# correlationfunnel 
_by [Business Science](https://www.business-science.io/)_

[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
[![Travis build status](https://travis-ci.org/business-science/correlationfunnel.svg?branch=master)](https://travis-ci.org/business-science/correlationfunnel)
[![Coverage status](https://codecov.io/gh/business-science/correlationfunnel/branch/master/graph/badge.svg)](https://codecov.io/github/business-science/correlationfunnel?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/correlationfunnel)](https://cran.r-project.org/package=correlationfunnel)
![](http://cranlogs.r-pkg.org/badges/correlationfunnel?color=brightgreen)
![](http://cranlogs.r-pkg.org/badges/grand-total/correlationfunnel?color=brightgreen)

> Speed Up Exploratory Data Analysis (EDA)

The goal of `correlationfunnel` is to speed up Exploratory Data Analysis (EDA). Here's how to use it. 

## Installation

You can install the latest stable (CRAN) version of `correlationfunnel` with:

``` r
install.packages("correlationfunnel")
```


You can install the development version of `correlationfunnel` from [GitHub](https://github.com/business-science/) with:

``` r
devtools::install_github("business-science/correlationfunnel")
```

## Correlation Funnel in 2-Minutes

__Problem__:
Exploratory data analysis (EDA) involves looking at feature-target relationships independently. This process is very time consuming even for small data sets. ___Rather than search for relationships, what if we could let the relationships come to us?___



__Solution:__ 
Enter `correlationfunnel`. The package provides a __succinct workflow__ and __interactive visualization tools__ for understanding which features have relationships to target (response). 

__Main Benefits__:

1. __Speeds Up Exploratory Data Analysis__ 

2. __Improves Feature Selection__ 

3. __Gets You To Business Insights Faster__ 

## Example - Bank Marketing Campaign

The following example showcases the power of __fast exploratory correlation analysis__. The goal of the analysis is to determine which features relate to the bank's marketing campaign goal of having customers opt into a TERM DEPOSIT (financial product). 

We will see that using __3 functions__, we can quickly:

1. Transform the data into a binary format with `binarize()`

2. Perform correlation analysis using `correlate()`

3. Visualize the highest correlation features using `plot_correlation_funnel()`

__Result__: Rather than spend hours looking at individual plots of capaign features and comparing them to which customers opted in to the TERM DEPOSIT product, in seconds we can discover which groups of customers have enrolled, drastically speeding up EDA. 

### Getting Started

First, load the libraries.

```{r example}
library(correlationfunnel)
library(dplyr)
```

Next, collect data to analyze. We'll use Marketing Campaign Data for a Bank that was popularized by the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing). We can load the data with `data("marketing_campaign_tbl")`.

```{r}
# Use ?marketing_campagin_tbl to get a description of the marketing campaign features
data("marketing_campaign_tbl")

marketing_campaign_tbl %>% glimpse()
```

### Response & Predictor Relationships

Modeling and Machine Learning problems often involve a response (Enrolled in `TERM_DEPOSIT`, yes/no) and many predictors (AGE, JOB, MARITAL, etc). Our job is to determine which predictors are related to the response. We can do this through __Binary Correlation Analysis__.

### Binary Correlation Analysis

Binary Correlation Analysis is the process of converting continuous (numeric) and categorical (character/factor) data to binary features. We can then perform a correlation analysis to see if there is predictive value between the features and the response (target).

#### Step 1: Convert to Binary Format

The first step is converting the continuous and categorical data into binary (0/1) format. We de-select any non-predictive features. The `binarize()` function then converts the features into binary features. 

- __Numeric Features:__ Are binned into ranges or if few unique levels are binned by their value, and then converted to binary features via one-hot encoding

- __Categorical Features__: Are binned by one-hot encoding

The result is a data frame that has only binary data with columns representing the bins that the observations fall into. Note that the output is shown in the `glimpse()` format. THere are now 80 columns that are binary (0/1).

```{r}
marketing_campaign_binarized_tbl <- marketing_campaign_tbl %>%
    select(-ID) %>%
    binarize(n_bins = 4, thresh_infreq = 0.01)

marketing_campaign_binarized_tbl %>% glimpse()
```

#### Step 2: Perform Correlation Analysis

The second step is to perform a correlation analysis between the response (target = TERM_DEPOSIT_yes) and the rest of the features. This returns a specially formatted tibble with the feature, the bin, and the bin's correlation to the target. The format is exactly what we need for the next step - Producing the __Correlation Funnel__

```{r}
marketing_campaign_correlated_tbl <- marketing_campaign_binarized_tbl %>%
    correlate(target = TERM_DEPOSIT__yes)

marketing_campaign_correlated_tbl
```

#### Step 3: Visualize the Correlation Funnel

A __Correlation Funnel__ is an tornado plot that lists the highest correlation features (based on absolute magnitude) at the top of the and the lowest correlation features at the bottom. The resulting visualization looks like a Funnel. 

To produce the __Correlation Funnel__, use `plot_correlation_funnel()`. Try setting `interactive = TRUE` to get an interactive plot that can be zoomed in on.

```{r, fig.height=8}
marketing_campaign_correlated_tbl %>%
    plot_correlation_funnel(interactive = FALSE)
```


### Examining the Results

The most important features are towards the top. We can investigate these.

```{r, fig.height=3}
marketing_campaign_correlated_tbl %>%
    filter(feature %in% c("DURATION", "POUTCOME", "PDAYS", 
                          "PREVIOUS", "CONTACT", "HOUSING")) %>%
    plot_correlation_funnel(interactive = FALSE, limits = c(-0.4, 0.4))
```

We can see that the following prospect groups have a much greater correlation with enrollment in the TERM DEPOSIT product:

- When the DURATION, the amount of time a prospect is engaged in marketing campaign material, is 319 seconds or longer.

- When POUTCOME, whether or not a prospect has previously enrolled in a product, is "success".

- When CONTACT, the medium used to contact the person, is "cellular"

- When HOUSING, whether or not the contact has a HOME LOAN is "no"


## Other Great EDA Packages in R

The main addition of `correlationfunnel` is to quickly expose feature relationships to semi-processed data meaning missing (`NA`) values have been treated, date or date-time features have been feature engineered, and data is in a "clean" format (numeric data and categorical data are ready to be correlated to a Yes/No response). 

Here are several great EDA packages that can help you understand data issues (cleanliness) and get data preprared for Correlation Analysis!

- [Data Explorer](https://boxuancui.github.io/DataExplorer/) - Automates Exploration and Data Treatment. Amazing for investigating features quickly and efficiently including by data type, missing data, feature engineering, and identifying relationships. 

- [naniar](http://naniar.njtierney.com/) - For understanding missing data.

- [UpSetR](https://github.com/hms-dbmi/UpSetR) - For generating upset plots

- [GGally](https://ggobi.github.io/ggally/) - The `ggpairs()` function is one of my all-time favorites for visualizing many features quickly.


## Using Correlation Funnel? You Might Be Interested in Applied Business Education

[___Business Science___](https://www.business-science.io/) teaches students how to apply data science for business. The entire curriculum is crafted around business consulting with data science.  _Correlation Analysis_ is one of the many techniques that we teach in our curriculum. 
__Learn from our data science application experience with real-world business projects.__
   

### Learn from Real-World Business Projects

Students learn by solving real world projects using our repeatable project-management framework along with cutting-edge tools like the Correlation Analysis, Automated Machine Learning, and Feature Explanation as part of our ROI-Driven Data Science Curriculum.



- [__Learn Data Science Foundations (DS4B 101-R)__](https://university.business-science.io/p/ds4b-101-r-business-analysis-r): Learn the entire `tidyverse` (`dplyr`, `ggplot2`, `rmarkdown`, & more) and `parsnip` - Solve 2 Projects - Customer Segmentation and Price Optimization projects

- [__Learn Advanced Machine Learning & Business Consulting (DS4B 201-R)__](https://university.business-science.io/p/hr201-using-machine-learning-h2o-lime-to-predict-employee-turnover/): Churn Project solved with Correlation Analysis, `H2O` AutoML, `LIME` Feature Explanation, and ROI-driven Analysis / Recommendation Systems

- [__Learn Predictive Web Application Development (DS4B 102-R)__](https://university.business-science.io/p/ds4b-102-r-shiny-web-application-business-level-1/): Build 2 Predictive `Shiny` Web Apps - Sales Dashboard with Demand Forecasting & Price Prediction App

Owner

  • Name: Business Science
  • Login: business-science
  • Kind: organization
  • Email: info@business-science.io
  • Location: United States of America

Applying data science to business & financial analysis, tw: @bizScienc

GitHub Events

Total
  • Watch event: 7
  • Issue comment event: 1
  • Fork event: 3
Last Year
  • Watch event: 7
  • Issue comment event: 1
  • Fork event: 3

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 53
  • Total Committers: 2
  • Avg Commits per committer: 26.5
  • Development Distribution Score (DDS): 0.019
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Matt Dancho m****o@g****m 52
olivroy 5****y 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 8
  • Total pull requests: 2
  • Average time to close issues: 19 days
  • Average time to close pull requests: 4 days
  • Total issue authors: 8
  • Total pull request authors: 2
  • Average comments per issue: 0.88
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ddalevi (1)
  • AnikMallick (1)
  • bokov (1)
  • UTexas80 (1)
  • paullevchuk (1)
  • ahmedkhancgt (1)
  • sfd99 (1)
  • Bryan-Hutchinson (1)
Pull Request Authors
  • h4rvey-g (2)
  • olivroy (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 588 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 2
  • Total versions: 2
  • Total maintainers: 1
cran.r-project.org: correlationfunnel

Speed Up Exploratory Data Analysis (EDA) with the Correlation Funnel

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 2
  • Downloads: 588 Last month
Rankings
Stargazers count: 3.4%
Forks count: 3.5%
Average: 14.7%
Downloads: 18.8%
Dependent repos count: 19.2%
Dependent packages count: 28.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.1 depends
  • cli * imports
  • crayon * imports
  • dplyr >= 1.0.0 imports
  • forcats * imports
  • ggplot2 * imports
  • ggrepel * imports
  • magrittr * imports
  • plotly * imports
  • purrr * imports
  • recipes * imports
  • rlang * imports
  • rstudioapi * imports
  • stats * imports
  • stringr * imports
  • tibble * imports
  • tidyr >= 1.0.0 imports
  • utils * imports
  • covr * suggests
  • knitr * suggests
  • lubridate * suggests
  • rmarkdown * suggests
  • scales * suggests
  • testthat >= 2.1.0 suggests