Introducing Correlation Funnel - Customer Churn Example

Speed Up Exploratory Data Analysis (EDA) with correlationfunnel

The goal of correlationfunnel is to help data scientist’s speed up Exploratory Data Analysis (EDA). EDA can be an incredibly time consuming process.

Problem

Traditional approaches to EDA are labor intense where the data scientist reviews each of the features (predictors) in the data set for relationship to the target (i.e. goal or response). This process of manually building many visualizations and searching for relationships can take hours.

Solution

Correlation Analysis on data that has been preprocessed (more on this shortly) can drastically speed up EDA by identifying key features that relate to the target. The key is getting the features into the “right format”. This is where correlationfunnel helps.

The correlationfunnel package includes a streamlined 3-step process for preparing data and performing visual Correlation Analysis. The visualization produced uncovers insights by elevating high-correlation features and loweribng low-correlation features. The shape looks like a funnel (hence the name “Correlation Funnel”), making it very efficient to understand which features are most likely to provide business insights and lend well to a machine learning model.

Main Benefits

  1. Speeds Up Exploratory Data Analysis - You can drastically increase the speed at which you perform Exploratory Data Analysis (EDA) by using Correlation Analysis to focus on key features (rather than investigating all features).

  2. Improves Feature Selection - Using correlation to determine if you have good features prior to spending significant time developing Machine Learning Models.

  3. Gets You To Business Insights Faster - Understanding how features are related to a target variable can help you develop the story in the data (aka business insights).

Correlation Funnel Process

The Correlation Funnel process uses 3 functions:

  1. Transform the data into a binary format with binarize() - This step prepares semi-processed data for an optimal format (binary) for correlation analysis

  2. Perform correlation analysis using correlate() - This step correlates the “binarized” data (binary features) with the target

  3. Visualize the feature-target relationships using plot_correlation_funnel() - This step produces the visualization from which we can get business insights

Example - Customer Churn

We’ll step through an example of understanding what features are related to Customer Churn.

Load the necessary libraries.

library(correlationfunnel)
library(dplyr)

Get the customer_churn_tbl dataset. The dataset contains a number of features related to a telecommunications company’s customer-base and whether or not the customer has churned. The target is “Churn”.

data("customer_churn_tbl")

customer_churn_tbl %>% glimpse()
#> Rows: 7,043
#> Columns: 21
#> $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOC…
#> $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female"…
#> $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Ye…
#> $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No…
#> $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, …
#> $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No",…
#> $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", …
#> $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber op…
#> $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", …
#> $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "…
#> $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "…
#> $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
#> $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Y…
#> $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Ye…
#> $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One…
#> $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No",…
#> $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", …
#> $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.…
#> $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 194…
#> $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "…

Step 1 - Prepare Data as Binary Features

We use the binarize() function to produce a feature set of binary (0/1) variables. Numeric data are binned (using n_bins) into categorical data, then all categorical data is one-hot encoded to produce binary features. To prevent low frequency categories (high cardinality categories) from increasing the dimensionality (width of the resulting data frame), we use thresh_infreq = 0.01 and name_infreq = "OTHER" to group excess categories.

customer_churn_binarized_tbl <- customer_churn_tbl %>%
  select(-customerID) %>%
  mutate(TotalCharges = ifelse(is.na(TotalCharges), MonthlyCharges, TotalCharges)) %>%
  binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE)

customer_churn_binarized_tbl %>% glimpse()
#> Rows: 7,043
#> Columns: 60
#> $ gender__Female                             <dbl> 1, 0, 0, 0, 1, 1, 0, 1, 1,…
#> $ gender__Male                               <dbl> 0, 1, 1, 1, 0, 0, 1, 0, 0,…
#> $ SeniorCitizen__0                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ SeniorCitizen__1                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Partner__No                                <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0,…
#> $ Partner__Yes                               <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ Dependents__No                             <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1,…
#> $ Dependents__Yes                            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
#> $ `tenure__-Inf_6`                           <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0,…
#> $ tenure__6_20                               <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0,…
#> $ tenure__20_40                              <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1,…
#> $ tenure__40_60                              <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ tenure__60_Inf                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ PhoneService__No                           <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ PhoneService__Yes                          <dbl> 0, 1, 1, 0, 1, 1, 1, 0, 1,…
#> $ MultipleLines__No                          <dbl> 0, 1, 1, 0, 1, 0, 0, 0, 0,…
#> $ MultipleLines__No_phone_service            <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ MultipleLines__Yes                         <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1,…
#> $ InternetService__DSL                       <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0,…
#> $ InternetService__Fiber_optic               <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1,…
#> $ InternetService__No                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ OnlineSecurity__No                         <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1,…
#> $ OnlineSecurity__No_internet_service        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ OnlineSecurity__Yes                        <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 0,…
#> $ OnlineBackup__No                           <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1,…
#> $ OnlineBackup__No_internet_service          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ OnlineBackup__Yes                          <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0,…
#> $ DeviceProtection__No                       <dbl> 1, 0, 1, 0, 1, 0, 1, 1, 0,…
#> $ DeviceProtection__No_internet_service      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ DeviceProtection__Yes                      <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 1,…
#> $ TechSupport__No                            <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0,…
#> $ TechSupport__No_internet_service           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ TechSupport__Yes                           <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1,…
#> $ StreamingTV__No                            <dbl> 1, 1, 1, 1, 1, 0, 0, 1, 0,…
#> $ StreamingTV__No_internet_service           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ StreamingTV__Yes                           <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1,…
#> $ StreamingMovies__No                        <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0,…
#> $ StreamingMovies__No_internet_service       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ StreamingMovies__Yes                       <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1,…
#> $ `Contract__Month-to-month`                 <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 1,…
#> $ Contract__One_year                         <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,…
#> $ Contract__Two_year                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ PaperlessBilling__No                       <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 0,…
#> $ PaperlessBilling__Yes                      <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1,…
#> $ `PaymentMethod__Bank_transfer_(automatic)` <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ `PaymentMethod__Credit_card_(automatic)`   <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
#> $ PaymentMethod__Electronic_check            <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1,…
#> $ PaymentMethod__Mailed_check                <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0,…
#> $ `MonthlyCharges__-Inf_25.05`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ MonthlyCharges__25.05_58.83                <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0,…
#> $ MonthlyCharges__58.83_79.1                 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ MonthlyCharges__79.1_94.25                 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
#> $ MonthlyCharges__94.25_Inf                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1,…
#> $ `TotalCharges__-Inf_265.32`                <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0,…
#> $ TotalCharges__265.32_939.78                <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0,…
#> $ TotalCharges__939.78_2043.71               <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0,…
#> $ TotalCharges__2043.71_4471.44              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,…
#> $ TotalCharges__4471.44_Inf                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ Churn__No                                  <dbl> 1, 1, 0, 1, 0, 0, 1, 1, 0,…
#> $ Churn__Yes                                 <dbl> 0, 0, 1, 0, 1, 1, 0, 0, 1,…

Step 2 - Correlate to the Target

Next, we use correlate() to correlate the binary features to a target (in our case Customer Churn).

customer_churn_corr_tbl <- customer_churn_binarized_tbl %>%
  correlate(Churn__Yes)

customer_churn_corr_tbl
#> # A tibble: 60 x 3
#>    feature         bin              correlation
#>    <fct>           <chr>                  <dbl>
#>  1 Churn           No                    -1    
#>  2 Churn           Yes                    1    
#>  3 Contract        Month-to-month         0.405
#>  4 OnlineSecurity  No                     0.343
#>  5 TechSupport     No                     0.337
#>  6 tenure          -Inf_6                 0.309
#>  7 InternetService Fiber_optic            0.308
#>  8 Contract        Two_year              -0.302
#>  9 PaymentMethod   Electronic_check       0.302
#> 10 OnlineBackup    No                     0.268
#> # … with 50 more rows

Step 3 - Plot the Correlation Funnel

Finally, we visualize the correlation using the plot_correlation_funnel() function.

customer_churn_corr_tbl %>%
  plot_correlation_funnel()

Business Insights

We can see that the following features are correlated with Churn:

We can also see that the following features are correlated with Staying (No Churn):

We can then develop a strategy to retain high risk customers:

Conclusion

The correlationfunnel package provides a 3-step workflow that streamlines the EDA process, helps with feature selection, and improves the ease of obtaining Business Insights.

More Information

To learn about the inner-workings of and key considerations for use of correlationfunnel, please read the Key Considerations and FAQs.