Lending Club Loan Data Analysis and Prediction

Project Outlines:

1. Define goal

a.Identify most important attributes separating bad loans and good loans 
b.Build Xgboost model to make prediction

2. EDA

a.Data first impression
  -available predicting variables
  -data distribution
  -missing values
b.Split dataset in to training and test set
c.correlation analysis and remove multicollinearity
d.Data Visualization to explore relationship between target and predicting variables

3. Data Cleansing and Feature Engineering

a. Handling missing values
b. Transform any characteristics or categorical variables into numeric
c. Create new features from existing features

4. Prepare dataset for modeling:

- Standard scale
- Handling Dataset imbalance issues (upsampling)

5. Model Training and Evaluation

1). Logistic regression with L1 regularization model
2). Random Forest model 
3). Xgboost model
    -hyperparameters Tuning

6. Feature Selections Consideration

- Remove variables according to correlation analysis
- Logistic regression with L1 regularization (coeffecients not zero)
- Random Forest model built-in feature importance

7. Assess any additional feature engineering or feature selection opportunity based on model results

8. Choose the best model and evaluate prediction on test dataset

9. Areas of improvements

Part-1

The project is split into 3 notebooks. The notebook is part one, focusing on Data Exploratory Analysis

Data Source

Link to data

In [2]:
import re
import pickle
import pandas as pd
import numpy as np
import matplotlib 
from matplotlib import pyplot as plt
import seaborn as sns
import os
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
matplotlib.rcParams.update({'font.size': 10})
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina' 
%matplotlib inline
from sklearn.linear_model import RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTENC
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
Using TensorFlow backend.
In [4]:
# load data
data = pd.read_csv("./902_370089_bundle_archive/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv")
In [5]:
data.shape
Out[5]:
(2260701, 151)
In [6]:
data.head()
Out[6]:
id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade ... hardship_payoff_balance_amount hardship_last_payment_amount disbursement_method debt_settlement_flag debt_settlement_flag_date settlement_status settlement_date settlement_amount settlement_percentage settlement_term
0 68407277 NaN 3600.0 3600.0 3600.0 36 months 13.99 123.03 C C4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
1 68355089 NaN 24700.0 24700.0 24700.0 36 months 11.99 820.28 C C1 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
2 68341763 NaN 20000.0 20000.0 20000.0 60 months 10.78 432.66 B B4 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
3 66310712 NaN 35000.0 35000.0 35000.0 60 months 14.85 829.90 C C5 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN
4 68476807 NaN 10400.0 10400.0 10400.0 60 months 22.45 289.91 F F1 ... NaN NaN Cash N NaN NaN NaN NaN NaN NaN

5 rows × 151 columns

In [7]:
print(data.info(verbose =True, max_cols = True,null_counts=True))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Data columns (total 151 columns):
id                                            2260701 non-null object
member_id                                     0 non-null float64
loan_amnt                                     2260668 non-null float64
funded_amnt                                   2260668 non-null float64
funded_amnt_inv                               2260668 non-null float64
term                                          2260668 non-null object
int_rate                                      2260668 non-null float64
installment                                   2260668 non-null float64
grade                                         2260668 non-null object
sub_grade                                     2260668 non-null object
emp_title                                     2093699 non-null object
emp_length                                    2113761 non-null object
home_ownership                                2260668 non-null object
annual_inc                                    2260664 non-null float64
verification_status                           2260668 non-null object
issue_d                                       2260668 non-null object
loan_status                                   2260668 non-null object
pymnt_plan                                    2260668 non-null object
url                                           2260668 non-null object
desc                                          126067 non-null object
purpose                                       2260668 non-null object
title                                         2237343 non-null object
zip_code                                      2260667 non-null object
addr_state                                    2260668 non-null object
dti                                           2258957 non-null float64
delinq_2yrs                                   2260639 non-null float64
earliest_cr_line                              2260639 non-null object
fico_range_low                                2260668 non-null float64
fico_range_high                               2260668 non-null float64
inq_last_6mths                                2260638 non-null float64
mths_since_last_delinq                        1102166 non-null float64
mths_since_last_record                        359156 non-null float64
open_acc                                      2260639 non-null float64
pub_rec                                       2260639 non-null float64
revol_bal                                     2260668 non-null float64
revol_util                                    2258866 non-null float64
total_acc                                     2260639 non-null float64
initial_list_status                           2260668 non-null object
out_prncp                                     2260668 non-null float64
out_prncp_inv                                 2260668 non-null float64
total_pymnt                                   2260668 non-null float64
total_pymnt_inv                               2260668 non-null float64
total_rec_prncp                               2260668 non-null float64
total_rec_int                                 2260668 non-null float64
total_rec_late_fee                            2260668 non-null float64
recoveries                                    2260668 non-null float64
collection_recovery_fee                       2260668 non-null float64
last_pymnt_d                                  2258241 non-null object
last_pymnt_amnt                               2260668 non-null float64
next_pymnt_d                                  915358 non-null object
last_credit_pull_d                            2260596 non-null object
last_fico_range_high                          2260668 non-null float64
last_fico_range_low                           2260668 non-null float64
collections_12_mths_ex_med                    2260523 non-null float64
mths_since_last_major_derog                   580775 non-null float64
policy_code                                   2260668 non-null float64
application_type                              2260668 non-null object
annual_inc_joint                              120710 non-null float64
dti_joint                                     120706 non-null float64
verification_status_joint                     115730 non-null object
acc_now_delinq                                2260639 non-null float64
tot_coll_amt                                  2190392 non-null float64
tot_cur_bal                                   2190392 non-null float64
open_acc_6m                                   1394538 non-null float64
open_act_il                                   1394539 non-null float64
open_il_12m                                   1394539 non-null float64
open_il_24m                                   1394539 non-null float64
mths_since_rcnt_il                            1350744 non-null float64
total_bal_il                                  1394539 non-null float64
il_util                                       1191818 non-null float64
open_rv_12m                                   1394539 non-null float64
open_rv_24m                                   1394539 non-null float64
max_bal_bc                                    1394539 non-null float64
all_util                                      1394320 non-null float64
total_rev_hi_lim                              2190392 non-null float64
inq_fi                                        1394539 non-null float64
total_cu_tl                                   1394538 non-null float64
inq_last_12m                                  1394538 non-null float64
acc_open_past_24mths                          2210638 non-null float64
avg_cur_bal                                   2190322 non-null float64
bc_open_to_buy                                2185733 non-null float64
bc_util                                       2184597 non-null float64
chargeoff_within_12_mths                      2260523 non-null float64
delinq_amnt                                   2260639 non-null float64
mo_sin_old_il_acct                            2121597 non-null float64
mo_sin_old_rev_tl_op                          2190391 non-null float64
mo_sin_rcnt_rev_tl_op                         2190391 non-null float64
mo_sin_rcnt_tl                                2190392 non-null float64
mort_acc                                      2210638 non-null float64
mths_since_recent_bc                          2187256 non-null float64
mths_since_recent_bc_dlq                      519701 non-null float64
mths_since_recent_inq                         1965233 non-null float64
mths_since_recent_revol_delinq                740359 non-null float64
num_accts_ever_120_pd                         2190392 non-null float64
num_actv_bc_tl                                2190392 non-null float64
num_actv_rev_tl                               2190392 non-null float64
num_bc_sats                                   2202078 non-null float64
num_bc_tl                                     2190392 non-null float64
num_il_tl                                     2190392 non-null float64
num_op_rev_tl                                 2190392 non-null float64
num_rev_accts                                 2190391 non-null float64
num_rev_tl_bal_gt_0                           2190392 non-null float64
num_sats                                      2202078 non-null float64
num_tl_120dpd_2m                              2107011 non-null float64
num_tl_30dpd                                  2190392 non-null float64
num_tl_90g_dpd_24m                            2190392 non-null float64
num_tl_op_past_12m                            2190392 non-null float64
pct_tl_nvr_dlq                                2190237 non-null float64
percent_bc_gt_75                              2185289 non-null float64
pub_rec_bankruptcies                          2259303 non-null float64
tax_liens                                     2260563 non-null float64
tot_hi_cred_lim                               2190392 non-null float64
total_bal_ex_mort                             2210638 non-null float64
total_bc_limit                                2210638 non-null float64
total_il_high_credit_limit                    2190392 non-null float64
revol_bal_joint                               108020 non-null float64
sec_app_fico_range_low                        108021 non-null float64
sec_app_fico_range_high                       108021 non-null float64
sec_app_earliest_cr_line                      108021 non-null object
sec_app_inq_last_6mths                        108021 non-null float64
sec_app_mort_acc                              108021 non-null float64
sec_app_open_acc                              108021 non-null float64
sec_app_revol_util                            106184 non-null float64
sec_app_open_act_il                           108021 non-null float64
sec_app_num_rev_accts                         108021 non-null float64
sec_app_chargeoff_within_12_mths              108021 non-null float64
sec_app_collections_12_mths_ex_med            108021 non-null float64
sec_app_mths_since_last_major_derog           35942 non-null float64
hardship_flag                                 2260668 non-null object
hardship_type                                 10917 non-null object
hardship_reason                               10917 non-null object
hardship_status                               10917 non-null object
deferral_term                                 10917 non-null float64
hardship_amount                               10917 non-null float64
hardship_start_date                           10917 non-null object
hardship_end_date                             10917 non-null object
payment_plan_start_date                       10917 non-null object
hardship_length                               10917 non-null float64
hardship_dpd                                  10917 non-null float64
hardship_loan_status                          10917 non-null object
orig_projected_additional_accrued_interest    8651 non-null float64
hardship_payoff_balance_amount                10917 non-null float64
hardship_last_payment_amount                  10917 non-null float64
disbursement_method                           2260668 non-null object
debt_settlement_flag                          2260668 non-null object
debt_settlement_flag_date                     34246 non-null object
settlement_status                             34246 non-null object
settlement_date                               34246 non-null object
settlement_amount                             34246 non-null float64
settlement_percentage                         34246 non-null float64
settlement_term                               34246 non-null float64
dtypes: float64(113), object(38)
memory usage: 2.5+ GB
None
In [8]:
# Loading in the data dictionary
xls = pd.ExcelFile('LCDataDictionary.xlsx')
dict1 = pd.read_excel(xls, 'LoanStats')
dict2 = pd.read_excel(xls, 'browseNotes')
In [9]:
# Column Names in the data dictionary
print(dict1.columns.tolist())
print(dict2.columns.tolist())
['LoanStatNew', 'Description']
['BrowseNotesFile', 'Description']
In [10]:
dict1.loc[dict1['LoanStatNew'].isnull(),:]
Out[10]:
LoanStatNew Description
151 NaN NaN
152 NaN * Employer Title replaces Employer Name for al...
In [11]:
dict2.loc[dict2['BrowseNotesFile'].isnull(),:]
Out[11]:
BrowseNotesFile Description
120 NaN NaN
121 NaN * Employer Title replaces Employer Name for al...
In [12]:
# Remove NaN from data dictionary LoanStatNew
dict1.dropna(axis=0, subset=['LoanStatNew'], inplace= True)
In [13]:
# Remove NaN from data dictionary LoanStatNew
dict2.dropna(axis=0, subset=['BrowseNotesFile'], inplace= True)
In [14]:
print('Number of Features in LoanStatNew:' ,dict1.shape[0])
print('Number of Features in BrowseNotesFile:', dict2.shape[0])
Number of Features in LoanStatNew: 151
Number of Features in BrowseNotesFile: 120
In [15]:
# available columns in data dictionary LoanStatNew
dict1['LoanStatNew'].tolist() 
Out[15]:
['acc_now_delinq',
 'acc_open_past_24mths',
 'addr_state',
 'all_util',
 'annual_inc',
 'annual_inc_joint',
 'application_type',
 'avg_cur_bal',
 'bc_open_to_buy',
 'bc_util',
 'chargeoff_within_12_mths',
 'collection_recovery_fee',
 'collections_12_mths_ex_med',
 'delinq_2yrs',
 'delinq_amnt',
 'desc',
 'dti',
 'dti_joint',
 'earliest_cr_line',
 'emp_length',
 'emp_title',
 'fico_range_high',
 'fico_range_low',
 'funded_amnt',
 'funded_amnt_inv',
 'grade',
 'home_ownership',
 'id',
 'il_util',
 'initial_list_status',
 'inq_fi',
 'inq_last_12m',
 'inq_last_6mths',
 'installment',
 'int_rate',
 'issue_d',
 'last_credit_pull_d',
 'last_fico_range_high',
 'last_fico_range_low',
 'last_pymnt_amnt',
 'last_pymnt_d',
 'loan_amnt',
 'loan_status',
 'max_bal_bc',
 'member_id',
 'mo_sin_old_il_acct',
 'mo_sin_old_rev_tl_op',
 'mo_sin_rcnt_rev_tl_op',
 'mo_sin_rcnt_tl',
 'mort_acc',
 'mths_since_last_delinq',
 'mths_since_last_major_derog',
 'mths_since_last_record',
 'mths_since_rcnt_il',
 'mths_since_recent_bc',
 'mths_since_recent_bc_dlq',
 'mths_since_recent_inq',
 'mths_since_recent_revol_delinq',
 'next_pymnt_d',
 'num_accts_ever_120_pd',
 'num_actv_bc_tl',
 'num_actv_rev_tl',
 'num_bc_sats',
 'num_bc_tl',
 'num_il_tl',
 'num_op_rev_tl',
 'num_rev_accts',
 'num_rev_tl_bal_gt_0',
 'num_sats',
 'num_tl_120dpd_2m',
 'num_tl_30dpd',
 'num_tl_90g_dpd_24m',
 'num_tl_op_past_12m',
 'open_acc',
 'open_acc_6m',
 'open_il_12m',
 'open_il_24m',
 'open_act_il',
 'open_rv_12m',
 'open_rv_24m',
 'out_prncp',
 'out_prncp_inv',
 'pct_tl_nvr_dlq',
 'percent_bc_gt_75',
 'policy_code',
 'pub_rec',
 'pub_rec_bankruptcies',
 'purpose',
 'pymnt_plan',
 'recoveries',
 'revol_bal',
 'revol_util',
 'sub_grade',
 'tax_liens',
 'term',
 'title',
 'tot_coll_amt',
 'tot_cur_bal',
 'tot_hi_cred_lim',
 'total_acc',
 'total_bal_ex_mort',
 'total_bal_il',
 'total_bc_limit',
 'total_cu_tl',
 'total_il_high_credit_limit',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_int',
 'total_rec_late_fee',
 'total_rec_prncp',
 'total_rev_hi_lim \xa0',
 'url',
 'verification_status',
 'verified_status_joint',
 'zip_code',
 'revol_bal_joint ',
 'sec_app_fico_range_low ',
 'sec_app_fico_range_high ',
 'sec_app_earliest_cr_line ',
 'sec_app_inq_last_6mths ',
 'sec_app_mort_acc ',
 'sec_app_open_acc ',
 'sec_app_revol_util ',
 'sec_app_open_act_il',
 'sec_app_num_rev_accts ',
 'sec_app_chargeoff_within_12_mths ',
 'sec_app_collections_12_mths_ex_med ',
 'sec_app_mths_since_last_major_derog ',
 'hardship_flag',
 'hardship_type',
 'hardship_reason',
 'hardship_status',
 'deferral_term',
 'hardship_amount',
 'hardship_start_date',
 'hardship_end_date',
 'payment_plan_start_date',
 'hardship_length',
 'hardship_dpd',
 'hardship_loan_status',
 'orig_projected_additional_accrued_interest',
 'hardship_payoff_balance_amount',
 'hardship_last_payment_amount',
 'disbursement_method',
 'debt_settlement_flag',
 'debt_settlement_flag_date',
 'settlement_status',
 'settlement_date',
 'settlement_amount',
 'settlement_percentage',
 'settlement_term']
In [16]:
# available columns in data dictionary LoanStatNew
dict2['BrowseNotesFile'].tolist() 
Out[16]:
['acceptD',
 'accNowDelinq',
 'accOpenPast24Mths',
 'addrState',
 'all_util',
 'annual_inc_joint',
 'annualInc',
 'application_type',
 'avg_cur_bal',
 'bcOpenToBuy',
 'bcUtil',
 'chargeoff_within_12_mths',
 'collections_12_mths_ex_med',
 'creditPullD',
 'delinq2Yrs',
 'delinqAmnt',
 'desc',
 'dti',
 'dti_joint',
 'earliestCrLine',
 'effective_int_rate',
 'emp_title',
 'empLength',
 'expD',
 'expDefaultRate',
 'ficoRangeHigh',
 'ficoRangeLow',
 'fundedAmnt',
 'grade',
 'homeOwnership',
 'id',
 'il_util',
 'ils_exp_d',
 'initialListStatus',
 'inq_fi',
 'inq_last_12m',
 'inqLast6Mths',
 'installment',
 'intRate',
 'isIncV',
 'listD',
 'loanAmnt',
 'max_bal_bc',
 'memberId',
 'mo_sin_old_rev_tl_op',
 'mo_sin_rcnt_rev_tl_op',
 'mo_sin_rcnt_tl',
 'mortAcc',
 'msa',
 'mths_since_last_major_derog',
 'mths_since_oldest_il_open',
 'mths_since_rcnt_il',
 'mthsSinceLastDelinq',
 'mthsSinceLastRecord',
 'mthsSinceMostRecentInq',
 'mthsSinceRecentBc',
 'mthsSinceRecentLoanDelinq',
 'mthsSinceRecentRevolDelinq',
 'num_accts_ever_120_pd',
 'num_actv_bc_tl',
 'num_actv_rev_tl',
 'num_bc_sats',
 'num_bc_tl',
 'num_il_tl',
 'num_op_rev_tl',
 'num_rev_accts',
 'num_rev_tl_bal_gt_0',
 'num_sats',
 'num_tl_120dpd_2m',
 'num_tl_30dpd',
 'num_tl_90g_dpd_24m',
 'num_tl_op_past_12m',
 'open_acc_6m',
 'open_il_12m',
 'open_il_24m',
 'open_act_il',
 'open_rv_12m',
 'open_rv_24m',
 'openAcc',
 'pct_tl_nvr_dlq',
 'percentBcGt75',
 'pub_rec_bankruptcies',
 'pubRec',
 'purpose',
 'reviewStatus',
 'reviewStatusD',
 'revolBal',
 'revolUtil',
 'serviceFeeRate',
 'subGrade',
 'tax_liens',
 'term',
 'title',
 'tot_coll_amt',
 'tot_cur_bal',
 'tot_hi_cred_lim',
 'total_bal_il',
 'total_cu_tl',
 'total_il_high_credit_limit',
 'total_rev_hi_lim \xa0',
 'totalAcc',
 'totalBalExMort',
 'totalBcLimit',
 'url',
 'verified_status_joint',
 'zip_code',
 'revol_bal_joint ',
 'sec_app_fico_range_low ',
 'sec_app_fico_range_high ',
 'sec_app_earliest_cr_line ',
 'sec_app_inq_last_6mths ',
 'sec_app_mort_acc ',
 'sec_app_open_acc ',
 'sec_app_revol_util ',
 'sec_app_open_act_il',
 'sec_app_num_rev_accts ',
 'sec_app_chargeoff_within_12_mths ',
 'sec_app_collections_12_mths_ex_med ',
 'sec_app_mths_since_last_major_derog ',
 'disbursement_method']
In [17]:
#  All columns in the dataset
data.columns.tolist()
Out[17]:
['id',
 'member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'verification_status',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'url',
 'desc',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'earliest_cr_line',
 'fico_range_low',
 'fico_range_high',
 'inq_last_6mths',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'initial_list_status',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'next_pymnt_d',
 'last_credit_pull_d',
 'last_fico_range_high',
 'last_fico_range_low',
 'collections_12_mths_ex_med',
 'mths_since_last_major_derog',
 'policy_code',
 'application_type',
 'annual_inc_joint',
 'dti_joint',
 'verification_status_joint',
 'acc_now_delinq',
 'tot_coll_amt',
 'tot_cur_bal',
 'open_acc_6m',
 'open_act_il',
 'open_il_12m',
 'open_il_24m',
 'mths_since_rcnt_il',
 'total_bal_il',
 'il_util',
 'open_rv_12m',
 'open_rv_24m',
 'max_bal_bc',
 'all_util',
 'total_rev_hi_lim',
 'inq_fi',
 'total_cu_tl',
 'inq_last_12m',
 'acc_open_past_24mths',
 'avg_cur_bal',
 'bc_open_to_buy',
 'bc_util',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'mo_sin_old_il_acct',
 'mo_sin_old_rev_tl_op',
 'mo_sin_rcnt_rev_tl_op',
 'mo_sin_rcnt_tl',
 'mort_acc',
 'mths_since_recent_bc',
 'mths_since_recent_bc_dlq',
 'mths_since_recent_inq',
 'mths_since_recent_revol_delinq',
 'num_accts_ever_120_pd',
 'num_actv_bc_tl',
 'num_actv_rev_tl',
 'num_bc_sats',
 'num_bc_tl',
 'num_il_tl',
 'num_op_rev_tl',
 'num_rev_accts',
 'num_rev_tl_bal_gt_0',
 'num_sats',
 'num_tl_120dpd_2m',
 'num_tl_30dpd',
 'num_tl_90g_dpd_24m',
 'num_tl_op_past_12m',
 'pct_tl_nvr_dlq',
 'percent_bc_gt_75',
 'pub_rec_bankruptcies',
 'tax_liens',
 'tot_hi_cred_lim',
 'total_bal_ex_mort',
 'total_bc_limit',
 'total_il_high_credit_limit',
 'revol_bal_joint',
 'sec_app_fico_range_low',
 'sec_app_fico_range_high',
 'sec_app_earliest_cr_line',
 'sec_app_inq_last_6mths',
 'sec_app_mort_acc',
 'sec_app_open_acc',
 'sec_app_revol_util',
 'sec_app_open_act_il',
 'sec_app_num_rev_accts',
 'sec_app_chargeoff_within_12_mths',
 'sec_app_collections_12_mths_ex_med',
 'sec_app_mths_since_last_major_derog',
 'hardship_flag',
 'hardship_type',
 'hardship_reason',
 'hardship_status',
 'deferral_term',
 'hardship_amount',
 'hardship_start_date',
 'hardship_end_date',
 'payment_plan_start_date',
 'hardship_length',
 'hardship_dpd',
 'hardship_loan_status',
 'orig_projected_additional_accrued_interest',
 'hardship_payoff_balance_amount',
 'hardship_last_payment_amount',
 'disbursement_method',
 'debt_settlement_flag',
 'debt_settlement_flag_date',
 'settlement_status',
 'settlement_date',
 'settlement_amount',
 'settlement_percentage',
 'settlement_term']

Data leakage in Machine Learning

What is data leakage in Machine Learning?

Some of your predictors (features/indepedent variables) that you are using in training your predictive models happen to have information about your target variables. See more here.

Noted that some of the columns/features in the dataset are not available at the moment of investing until the loans have been issued. So these features should be removed, otherwise they would leek information from the future. We can use the variables that are in the data dictionary BrowseNoteFile, which lists only information available to investors when they are deciding whether to invest to any loans.

Noted that some features were named slightly different between the data dictionary sheet and the column names in the dataset but they referred to the same thing, so we will need to match the format between these column names.

In [18]:
# Noted format differences between data dictionary BrowseNote and columns name in the dataset
index2 = [x not in dict2['BrowseNotesFile'].tolist() for x in data.columns.tolist()]
pd.Series(data.columns.tolist())[index2].head()
Out[18]:
1          member_id
2          loan_amnt
3        funded_amnt
4    funded_amnt_inv
6           int_rate
dtype: object
In [19]:
# Match the format of data dictionary to the columns names in the dataset
dictionary = dict2['BrowseNotesFile'].dropna().values
dictionary = [re.sub('(?<![0-9_])(?=[A-Z0-9])', '_', x).lower().strip() for x in dictionary]
In [20]:
# Find features that are in dataset but not in the BrowseNote
index3 = [x not in dictionary for x in data.columns.tolist()]
pd.Series(data.columns.tolist())[index3].head()
Out[20]:
4         funded_amnt_inv
14    verification_status
15                issue_d
16            loan_status
17             pymnt_plan
dtype: object
In [21]:
# # The features that are in the BrowseNote but not in dataset
index4 = [x not in data.columns.tolist()for x in dictionary]
pd.Series(dictionary)[index4]
Out[21]:
0                           accept_d
13                     credit_pull_d
20                effective_int_rate
23                             exp_d
24                  exp_default_rate
32                         ils_exp_d
39                          is_inc_v
40                            list_d
48                               msa
50         mths_since_oldest_il_open
54        mths_since_most_recent_inq
56     mths_since_recent_loan_delinq
84                     review_status
85                   review_status_d
88                  service_fee_rate
104            verified_status_joint
dtype: object
In [22]:
# Manually match the spelling between column names in dataset and data dictionary 'BrowseNote'
uni_dict = ['is_inc_v','mths_since_oldest_il_open','mths_since_most_recent_inq',
             'mths_since_recent_loan_delinq','verified_status_joint']

uni_dataset = ['verification_status', 'mo_sin_old_il_acct','mths_since_recent_inq',
           'mths_since_recent_bc_dlq', 'verification_status_joint']


dictionary = np.setdiff1d(dictionary, uni_dict)
dictionary = np.append(dictionary,uni_dataset)

Create a list containing all features that are available in the dataset and also available to investors at the moment of investing

In [23]:
dictionary = np.intersect1d(dictionary, data.columns.tolist())
In [24]:
# Skim through the available features 
dictionary
Out[24]:
array(['acc_now_delinq', 'acc_open_past_24mths', 'addr_state', 'all_util',
       'annual_inc', 'annual_inc_joint', 'application_type',
       'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
       'chargeoff_within_12_mths', 'collections_12_mths_ex_med',
       'delinq_2yrs', 'delinq_amnt', 'desc', 'disbursement_method', 'dti',
       'dti_joint', 'earliest_cr_line', 'emp_length', 'emp_title',
       'fico_range_high', 'fico_range_low', 'funded_amnt', 'grade',
       'home_ownership', 'id', 'il_util', 'initial_list_status', 'inq_fi',
       'inq_last_12m', 'inq_last_6mths', 'installment', 'int_rate',
       'loan_amnt', 'max_bal_bc', 'member_id', 'mo_sin_old_il_acct',
       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
       'mort_acc', 'mths_since_last_delinq',
       'mths_since_last_major_derog', 'mths_since_last_record',
       'mths_since_rcnt_il', 'mths_since_recent_bc',
       'mths_since_recent_bc_dlq', 'mths_since_recent_inq',
       'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd',
       'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl',
       'num_il_tl', 'num_op_rev_tl', 'num_rev_accts',
       'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m',
       'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m',
       'open_acc', 'open_acc_6m', 'open_act_il', 'open_il_12m',
       'open_il_24m', 'open_rv_12m', 'open_rv_24m', 'pct_tl_nvr_dlq',
       'percent_bc_gt_75', 'pub_rec', 'pub_rec_bankruptcies', 'purpose',
       'revol_bal', 'revol_bal_joint', 'revol_util',
       'sec_app_chargeoff_within_12_mths',
       'sec_app_collections_12_mths_ex_med', 'sec_app_earliest_cr_line',
       'sec_app_fico_range_high', 'sec_app_fico_range_low',
       'sec_app_inq_last_6mths', 'sec_app_mort_acc',
       'sec_app_mths_since_last_major_derog', 'sec_app_num_rev_accts',
       'sec_app_open_acc', 'sec_app_open_act_il', 'sec_app_revol_util',
       'sub_grade', 'tax_liens', 'term', 'title', 'tot_coll_amt',
       'tot_cur_bal', 'tot_hi_cred_lim', 'total_acc', 'total_bal_ex_mort',
       'total_bal_il', 'total_bc_limit', 'total_cu_tl',
       'total_il_high_credit_limit', 'total_rev_hi_lim', 'url',
       'verification_status', 'verification_status_joint', 'zip_code'],
      dtype='<U42')

Noted that Issue_d is the only column in the dataset that reveals when the loan was originated. Althought it's the date when the loan was issued (information from the future), we are going to extract the year and month of the issue date, which would represent the time period when investors were making investing decisions about that loans.

In [25]:
# Adding back the target variable 'loan status' and 'issue_d' to the dictionary list
dictionary = np.append(dictionary, np.array(['loan_status', 'issue_d' ]))
In [26]:
# Drop features that are not available at the moment of investing,except for loan_status(target variable)
cols_to_remove = [x not in dictionary for x in data.columns.tolist()]
cols_to_remove = pd.Series(data.columns.tolist())[cols_to_remove]
data.drop(columns=cols_to_remove, inplace = True)

Now we have a dataset with all features that are available to investors at the moment of investing after removing features that leeks information from the future.

In [27]:
data.shape
Out[27]:
(2260701, 111)

Check missing values in the target variable

In [28]:
# Check if there is any missing values in the target variable, loan status
data.loan_status.isnull().sum()
Out[28]:
33
In [29]:
#Check data where loan status is NaN. Noted there is no loan information for any of these rows.
data.loc[data.loan_status.isnull(),:].head()
Out[29]:
id member_id loan_amnt funded_amnt term int_rate installment grade sub_grade emp_title ... sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_act_il sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog disbursement_method
421095 Total amount funded in policy code 1: 6417608175 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
421096 Total amount funded in policy code 2: 1944088810 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
528961 Total amount funded in policy code 1: 1741781700 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
528962 Total amount funded in policy code 2: 564202131 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
651664 Total amount funded in policy code 1: 1791201400 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 111 columns

In [30]:
# remove rows where loan status is missing 
data.dropna(axis=0, subset=['loan_status'], inplace= True)
In [31]:
# Let's check the distribution of target variable, loan status 
data.loan_status.value_counts()
Out[31]:
Fully Paid                                             1076751
Current                                                 878317
Charged Off                                             268559
Late (31-120 days)                                       21467
In Grace Period                                           8436
Late (16-30 days)                                         4349
Does not meet the credit policy. Status:Fully Paid        1988
Does not meet the credit policy. Status:Charged Off        761
Default                                                     40
Name: loan_status, dtype: int64
In [32]:
# Loan status distribution
plt.figure(figsize = (10,6))
ax = sns.countplot(x="loan_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Loan Status", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Status Types Distribution", fontsize=15)
sizes=[]
for p in ax.patches:
    height = p.get_height()
    sizes.append(height)
    ax.text(p.get_x()+p.get_width()/2.,
            height + 4,
            '{:1.2%}'.format(height/len(data)),
            ha="center", fontsize=10) 

Re-define target variable (to be binary )

There are 9 loan statuses:

  • Fully Paid
  • Current,
  • Charged Off
  • In Grace Period
  • Late (31-120 days)
  • Late (16-30 days)
  • Default
  • Does not meet the credit policy. Status:Fully Paid,
  • Does not meet the credit policy. Status:Charged Off

We consider "Fully Paid", "Current", and "Does not meet the credit policy. Status:Fully Paid" as a good-condition loans and everything else as bad-condition loans.

In [33]:
# define bad_indicators
bad_indicators  =  ["Charged Off ",
                    "Charged Off",
                    "Default",
                    "Does not meet the credit policy. Status:Charged Off",
                    "In Grace Period",
                    "Default Receiver",
                    "Late (16-30 days)",
                    "Late (31-120 days)"]
In [34]:
# Creat a new target variable
data['target'] = data['loan_status'].isin(bad_indicators).astype(int)
In [35]:
# "Good" and "Bad" Loans Distribution
ax = sns.countplot(x="target", data=data)
ax.set_xlabel("Loan Status Categories", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
ax.set_title("Loan Status Types Distribution", fontsize=18)

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 4,
            '{:1.2%}'.format(height/len(data)),
            ha="center", fontsize=10) 

Noted the dataset is imbalanced with the minority group consisting of 13.43% of the population compared to the majarity group of 86.57%.

Split Train Test Dataset

In [36]:
y = data['target']

# Split the dataset into training and test dataset with random sampling
xtrain, xtest, ytrain, ytest = train_test_split(data, y, test_size=0.20, 
                                                random_state=1, stratify=y)
In [37]:
#Make sure the training and test set have similar distribution of majority and minority group 
ytrain.mean() 
Out[37]:
0.13430214748520072
In [38]:
ytest.mean()
Out[38]:
0.13430089309806384
In [93]:
# save train, test sets for later use
xtrain.to_csv("xtrain.csv", index=False)
xtest.to_csv("xtest.csv", index=False)
ytrain.to_csv("ytrain.csv", index=False)
ytest.to_csv("ytest.csv", index=False)

EDA

Overview of Loans

Let's explore the data to see how features affect loan conditions. I will be focusing on the followings:

  • loan amount vs funded amount

  • Interest rate

  • loans terms

  • loan issue year

  • Borrower's address (state)

  • Borrower's Annual Income

  • Borrower's Employment Length

  • Loan Grade

  • Home ownership

  • Debt-to-income ratioo

  • Loan Purpose

Loan Amount vs Funded Amount

  • Majority of the loan amount is between $10,000 to $20,000, with an average of $15,044.
  • The distribution between loan amount and funded amount is quite similar, implying that for loans that passed approval, borrowers were highly likely to be funded with the desired amount
In [39]:
# Compare Loan Amount and Funded Amount
f, axes = plt.subplots(1, 2, figsize=(10, 4),  sharey=True)
sns.distplot(xtrain['loan_amnt'], hist=True, kde=True, 
             bins=int(180/5),  color="b",
             kde_kws={'linewidth': 2}, ax = axes[0])
axes[0].set_title('Loan Amount (in 1000s)')
axes[0].set_xlabel('loan_amnt')


sns.distplot(xtrain['funded_amnt'], hist=True, kde=True, 
             bins=int(180/5),  color="g",
             kde_kws={'linewidth': 2}, ax = axes[1])
axes[1].set_title('Funded Amount (in 1000s)')
axes[1].set_xlabel('funded_amnt')
plt.show()

print("Loan Amount")
print("Skewness: %f" % xtrain['loan_amnt'].skew())
print("Kurtosis: %f" % xtrain['loan_amnt'].kurt())

print("Funded Amount")
print("Skewness: %f" % xtrain['funded_amnt'].skew())
print("Kurtosis: %f" % xtrain['funded_amnt'].kurt())

print("Average Loan Amount : " + str(round(xtrain.loan_amnt.mean(),2)))
Loan Amount
Skewness: 0.777848
Kurtosis: -0.119390
Funded Amount
Skewness: 0.778844
Kurtosis: -0.116950
Average Loan Amount : 15044.23

Interest Rate

  • Majority of interest rate is between 10% to 20%
  • The average interest rate of good-condition loans (12.68%) is about 3% less than that of bad-condition loans (15.69%), implying that the higher interest rate (the higher monthly interest payment), the more likely that a loan defaults.
In [40]:
# Check distribution of loan interest rate
f, axes = plt.subplots(1, 2, figsize=(10, 4),  sharey=True)


sns.distplot(xtrain['int_rate'], hist=True, kde=True, 
             bins=int(180/5), 
             kde_kws={'linewidth': 2},ax = axes[0])
axes[0].set_xlabel('int_rate')
axes[0].set_title('int_rate')



sns.distplot(xtrain.loc[xtrain['target']==0,'int_rate'], hist=True, kde=True, 
             bins=int(180/5), label='0',
             kde_kws={'linewidth': 2},ax = axes[1])
sns.distplot(xtrain.loc[xtrain['target']==1,'int_rate'], hist=True, kde=True, 
             bins=int(180/5),  label='1',
             kde_kws={'linewidth': 2},ax = axes[1])
axes[1].set_xlabel('int_rate')
axes[1].set_title('int_rate by loan status')



plt.legend()
plt.show()

print("Skewness: %f" % xtrain['int_rate'].skew())
print("Kurtosis: %f" % xtrain['int_rate'].kurt())
Skewness: 0.769178
Kurtosis: 0.597467
In [41]:
print('Average Loan Interest Rate : ')
print('Good Loans:', xtrain.loc[xtrain['target']==0,'int_rate'].mean())
print('Bad Loans:', xtrain.loc[xtrain['target']==1,'int_rate'].mean())
Average Loan Interest Rate : 
Good Loans: 12.689932877459434
Bad Loans: 15.694198443769201

Based on the distribution plot and mean statistics, bad loans tend to have a high interest rate than good loans of lending club.

In [42]:
# Loan interest rate by loan grade
ax = sns.boxplot(x="grade", y="int_rate", data=xtrain, order=["A","B","C","D","E","F","G"])
                 
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("Loan Grade Categories", fontsize=10)
ax.set_ylabel("Loan IR", fontsize=10)
ax.set_title("Loan IR by Loan Grade", fontsize=15)
plt.show()

The plots shows that the better grade of loans, the lower interest rate.

Loans Terms

  • Loan term is either 36-month or 60-month (categorical variable)
  • loan with shorter loan term (36 months compared to 60 months) has higher probability of turning into a bad loan
In [43]:
# Loan term by loan status
ax = sns.countplot( x="term",   hue="target", data=xtrain)               
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("Loan Status Categories", fontsize=10)
ax.set_ylabel("Loan Term", fontsize=10)
ax.set_title("Loan Term by Loan Status", fontsize=15)

for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            '{:1.2f}%'.format(height/len(xtrain)*100),
            ha="center", fontsize=10) 

The plot above shows that a loan with shorter loan term (36 months compared to 60 months) is more like to turn into a bad loan.

Loan Issue Year

  • From 2007 to 2018 number of loans issued every year is growing
In [44]:
# Convert issue date to year and month separate variables
xtrain["issue_year"]= pd.to_datetime(xtrain.issue_d).dt.year
#xtrain['issue_y'] = pd.to_datetime(issue_d).dt.year.dropna().astype('int32')
In [45]:
#Total Loans by Years - Good vs Bad
plt.figure(figsize=(10,6))
ax = sns.countplot(x = 'issue_year', data=xtrain)
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("Years", fontsize=10)
ax.set_ylabel("Count Loans", fontsize=10)
ax.set_title("Loans Issued by Years ", fontsize=15)
sizes=[]
for p in ax.patches:
    height = p.get_height()
    sizes.append(height)
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            '{:1.2f}%'.format(height/len(xtrain)*100),
            ha="center", fontsize=12) 
ax.set_ylim(0, max(sizes) * 1.10)
plt.show()

Overview of Borrowers

Address (state)

Income

Average length of employment

Loan Grade

Home ownership

debt-to-income

Loan purpose

Borrower's Address(states)

  • Top 3 states in terms of number of loans issued: Califonia, New York and Texas
  • Top 3 states in terms of percentage of issued loans that are in bad-condition: Iowa, Alabama, Arkansas
In [46]:
#Total Loans by States - Good vs Bad
plt.figure(figsize=(26,14))
ax = sns.countplot(x='addr_state', data=xtrain, 
                   order = xtrain['addr_state'].value_counts().index)
                   
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("States", fontsize=10)
ax.set_ylabel("Count Loans", fontsize=10)
ax.set_title("Total Loans by Borrower's Address(States)", fontsize=15)
sizes=[]
for p in ax.patches:
    height = p.get_height()
    sizes.append(height)
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            '{:1.1f}%'.format(height/len(xtrain)*100),
            ha="center", fontsize=12) 
ax.set_ylim(0, max(sizes) * 1.10)
plt.show()
In [47]:
# Count good and bad loans by state
state_df = xtrain.groupby(['addr_state', 'target']).size().reset_index().pivot(
    columns='target', index='addr_state', values=0).sort_values(by=1, ascending = False)
In [48]:
#Total Loans by States - Good vs Bad

ax = state_df.plot.bar(stacked=True ,figsize= (18, 12))
                             
#ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("States", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Total Loans by Borrower's Address(States) - Good vs Bad Loans", fontsize=15)
#sizes=[]


for i in range(51):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.0f}%'.format(height/(height + ax.patches[i+51].get_height()) *100),
            ha="center", fontsize=10)
    
for i in range(51,102):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.0f}%'.format(height/(height + ax.patches[i-51].get_height()) *100),
            ha="center", fontsize=10)
plt.show()
In [49]:
state_df['default_rate'] =state_df[1]/(state_df[0]+state_df[1])
In [50]:
# Top 10 states with the lowest average default rate
state_df.sort_values(by='default_rate').head(10)
Out[50]:
target 0 1 default_rate
addr_state
ME 3701 288 0.072199
ID 3138 308 0.089379
VT 3578 364 0.092339
WV 6028 651 0.097470
NH 8012 873 0.098255
DC 3894 426 0.098611
OR 19270 2139 0.099911
SC 19961 2393 0.107050
CO 34375 4125 0.107143
ND 2549 314 0.109675
In [51]:
# Top 10 states with the highest average default rate
state_df.sort_values(by='default_rate', ascending = False).head(10)
Out[51]:
target 0 1 default_rate
addr_state
IA 11 3 0.214286
AL 18313 3546 0.162222
AR 11431 2155 0.158619
MS 8481 1591 0.157963
LA 17377 3246 0.157397
OK 14004 2538 0.153428
NY 126976 22138 0.148464
NV 22291 3883 0.148353
FL 110911 18781 0.144812
HI 7318 1238 0.144694

Both the bar plots and summary table (default rate by state) show that some state have higher default rate than others.

Borrower's Annual Income

  • The higher the income level of borrower, the less likely the loan turns into default.
In [52]:
# Check distribution of Borrower's Annual Income
# These plots are not very understandable becuase of skewness and outliers
f, axes = plt.subplots(1, 2, figsize=(10, 4),  sharey=True)

sns.distplot(xtrain['annual_inc'], hist=True, kde=True, bins=int(180/5), 
             kde_kws={'linewidth': 2},ax = axes[0])
axes[0].set_xlabel('annual income')
axes[0].set_title('Distribution of Annual Income')


sns.distplot(xtrain.loc[xtrain['target']==0,'annual_inc'], hist=True, kde=True, 
             bins=int(180/5), label='0', kde_kws={'linewidth': 2},ax = axes[1])
sns.distplot(xtrain.loc[xtrain['target']==1,'annual_inc'], hist=True, kde=True, 
             bins=int(180/5),  label='1', kde_kws={'linewidth': 2},ax = axes[1])
axes[1].set_xlabel('annual_inc')
axes[1].set_title('Annual Income by loan status')

plt.legend()
plt.show()

print("Skewness: %f" % xtrain['annual_inc'].skew())
print("Kurtosis: %f" % xtrain['annual_inc'].kurt())
Skewness: 495.671570
Kurtosis: 411316.825794
In [53]:
# Check distribution of Borrower's Annual Income, filtered for annual_inc < $2,000,000
f, axes = plt.subplots(1, 2, figsize=(10, 4),  sharey=True)

sns.distplot(xtrain.loc[xtrain['annual_inc'] < 2000000,'annual_inc'], hist=True, kde=True, 
             bins=int(180/5), kde_kws={'linewidth': 2},ax = axes[0])
axes[0].set_xlabel('annual income')
axes[0].set_title('Distribution of Annual Income ( < $2,000,000)')

sns.distplot(xtrain.loc[(xtrain['target']==0) & (xtrain['annual_inc'] < 2000000),'annual_inc'], hist=True, kde=True, 
             bins=int(180/5), label='0',kde_kws={'linewidth': 2},ax = axes[1])
sns.distplot(xtrain.loc[(xtrain['target']==1) & (xtrain['annual_inc'] < 2000000),'annual_inc'], hist=True, kde=True, 
             bins=int(180/5),  label='1',kde_kws={'linewidth': 2},ax = axes[1])
axes[1].set_xlabel('annual_inc')
axes[1].set_title('Annual Income by loan status ( < $2,000,000)')

plt.legend()
plt.show()

print("Skewness: %f" % xtrain['annual_inc'].skew())
print("Kurtosis: %f" % xtrain['annual_inc'].kurt())
Skewness: 495.671570
Kurtosis: 411316.825794
In [54]:
print('Annual Income Quantiles:')
print("Low Income (below 25%): Annual_Inc <",  np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.25) )
print("Median Income (between 25% and 75%):", np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.25),
       "< Annual_Inc <",
     np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.75))
print("High Income (above 75% below 95%): ", np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.75),
       "< Annual_Inc <",
     np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.95))
print("Extreme high income (Above 95%):  Annual_Inc >",
      np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.95))
Annual Income Quantiles:
Low Income (below 25%): Annual_Inc < 46000.0
Median Income (between 25% and 75%): 46000.0 < Annual_Inc < 93000.0
High Income (above 75% below 95%):  93000.0 < Annual_Inc < 160000.0
Extreme high income (Above 95%):  Annual_Inc > 160000.0
In [55]:
# Count good and bad loans by state
income_df = xtrain[['annual_inc', 'target']]
income_df.loc[income_df['annual_inc'] < 46000.0,'inc_cat'] = 'low_inc'
income_df.loc[(income_df['annual_inc'] >= 46000.0)
              & (income_df['annual_inc'] < 93000.0),'inc_cat'] = 'median_inc'
income_df.loc[(income_df['annual_inc'] >= 93000.0)
              & (income_df['annual_inc'] < 160000.0),'inc_cat'] = 'high_inc'
income_df.loc[income_df['annual_inc'] > 160000.0,'inc_cat'] = 'xhigh_inc'
In [56]:
#Good vs Bad Loans by Income Level 
income_table = income_df.groupby(['inc_cat', 'target']).size().reset_index().pivot(
    columns='target', index='inc_cat', values=0).sort_values(by=1, ascending = False)
ax = income_table.plot.bar(stacked=True ,figsize= (8, 6))
                             
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Income Level", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Loans by Borrower's Income Level - Good vs Bad Loans", fontsize=15)
#sizes=[]

for i in range(4):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.2f}%'.format(height/(height + ax.patches[i+4].get_height()) *100),
            ha="center", fontsize=10)
    
for i in range(4,8):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.2f}%'.format(height/(height + ax.patches[i-4].get_height()) *100),
            ha="center", fontsize=10)
plt.show()

We can see from the plot above, the higher the income level, the lower the loan default rate.

Borrower's employment length

  • There is no clear pattern between loan default and employment length.
In [57]:
#Good vs Bad Loans by emp_length
emp_length_table = xtrain.groupby(['emp_length', 'target']).size().reset_index().pivot(
    columns='target', index='emp_length', values=0).sort_values(by=1, ascending = False)
ax = emp_length_table.plot.bar(stacked=True ,figsize= (8, 6))
                             
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("employment length", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Loans by Borrower's employment length - Good vs Bad Loans", fontsize=15)
#sizes=[]

n = len(ax.patches)
for i in range(int(n/2)):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.2f}%'.format(height/(height + ax.patches[i+ int(n/2)].get_height()) *100),
            ha="center", fontsize=10)
    
for i in range(int(n/2),n):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.2f}%'.format(height/(height + ax.patches[i- int(n/2)].get_height()) *100),
            ha="center", fontsize=10)
plt.show()
In [58]:
# The employment length of largest two groups of borrowers
print('Percentage of borrowers ')
print('emp_length > 10+ ys: ', round((xtrain.emp_length.value_counts()/len(xtrain)).values[0],4))
print('emp_length <= 3 ys: ', round(sum((xtrain.emp_length.value_counts()/len(xtrain)).values[1:4]),4))
Percentage of borrowers 
emp_length > 10+ ys:  0.3308
emp_length <= 3 ys:  0.2541

There is no clear pattern between loan default and employment length. Noted that both people with long employment length(10+ years) and short employment length(<= 3 years) tend to borrow more.

Loan Grade

  • The better the loan grade(Grade A as opposite to Grade G), the lower probability that a loan turns into bad-condition
In [59]:
# Loan grade distribution 
plt.figure(figsize = (8,4))
ax = sns.countplot(x="grade", data=xtrain, order= xtrain.grade.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),horizontalalignment='right')
ax.set_xlabel("Loan Grade", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Grade Distribution", fontsize=15)
sizes=[]
for p in ax.patches:
    height = p.get_height()
    sizes.append(height)
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            '{:1.2f}%'.format(height/len(xtrain)*100),
            ha="center", fontsize=10) 

The majority of the loans are graded as A, B, C

In [60]:
#Loan Grade between 'good' and 'bad' loans
plt.figure(figsize = (10,6))
ax = sns.countplot(x="grade",hue="target", data=xtrain, order = xtrain.grade.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),horizontalalignment='right')
ax.set_xlabel("Loan Grade", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Grade Distribution", fontsize=15)
sizes=[]
n = len(ax.patches)
for i in range(int(n/2)):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height,
            '{:1.1f}%'.format(height/(height + ax.patches[i+ int(n/2)].get_height()) *100),
            ha="center", fontsize=10)
    
for i in range(int(n/2),n):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height,
            '{:1.1f}%'.format(height/(height + ax.patches[i- int(n/2)].get_height()) *100),
            ha="center", fontsize=10)
plt.show()

From the plot above, noted that the better the loan grade, the lower the "bad" loan ratios.

Home ownership

  • It's unclear of the relationship between home ownership and bad-loans
  • On average, 13.5% of home-owners in the dataset has turned a loan into default, while 15.6% of borrowers who rent has bad loans and 11.7% of mortage-holders has default loans.
In [61]:
#Homeownership and Loan Status
plt.figure(figsize = (10,6))
ax = sns.countplot(x="home_ownership",hue="target", data=xtrain)
ax.set_xticklabels(ax.get_xticklabels(),horizontalalignment='right')
ax.set_xlabel("home_ownership", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Home ownership by loan status", fontsize=15)
sizes=[]
n = len(ax.patches)
for i in range(int(n/2)):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height,
            '{:1.1f}%'.format(height/(height + ax.patches[i+ int(n/2)].get_height()) *100),
            ha="center", fontsize=10)
    
for i in range(int(n/2),n):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height,
            '{:1.1f}%'.format(height/(height + ax.patches[i- int(n/2)].get_height()) *100),
            ha="center", fontsize=10)
plt.show()

Debt-to-income

  • The higher debt-to-income ratio, the higher chance that the loans in bad-condition
  • Debt-to-income is the ratio of the borrower’s total monthly debt payments on the total debt obligations divided by the borrower’s self-reported monthly income. This ratio reveals the ability of borrower to repay their loans with their income.
In [62]:
# Check distribution of loan interest rate
f, axes = plt.subplots(2, 2, figsize=(10, 6),  sharey=True)
plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.5, wspace=0.4)
# Histogram 1 - debt-to-income ratio
sns.distplot(xtrain['dti'], hist=True, kde=True, 
             bins=int(180/5), 
             kde_kws={'linewidth': 2},ax = axes[0, 0])
axes[0,0].set_xlabel('debt-to-income ratio')
axes[0,0].set_title('Distribution of debt-to-income ratio')

# Histogram 2 - debt-to-income ratio between bad and good loans
sns.distplot(xtrain.loc[xtrain['target']==0,'dti'], hist=True, kde=True, 
             bins=int(180/5), label='0',
             kde_kws={'linewidth': 2},ax = axes[0,1 ])
sns.distplot(xtrain.loc[xtrain['target']==1,'dti'], hist=True, kde=True, 
             bins=int(180/5),  label='1',
             kde_kws={'linewidth': 2},ax = axes[0, 1])
axes[0,1].set_xlabel('debt-to-income')
axes[0,1].set_title('debt-to-income by loan status')

# Histogram 3 - debt to income ratio filtered for less-than-100-percent
sns.distplot(xtrain.loc[xtrain['dti']<100,"dti"], hist=True, kde=True, 
             bins=int(180/5), 
             kde_kws={'linewidth': 2},ax = axes[1,0])
axes[1,0].set_xlabel('debt-to-income ratio')
axes[1,0].set_title('Debt-to-income ratio (filtered <100%)')

#Histogram 4 - debt-to-income ratio between good and bad loans filtered for less-than-100-percent
sns.distplot(xtrain.loc[(xtrain['dti']<100) & (xtrain['target']==0),'dti'], 
             hist=True, kde=True, 
             bins=int(180/5), label='0',
             kde_kws={'linewidth': 2},ax = axes[1,1])
sns.distplot(xtrain.loc[(xtrain['dti']<100) & (xtrain['target']==1),'dti'], 
            hist=True, kde=True, 
             bins=int(180/5),  label='1',
             kde_kws={'linewidth': 2},ax = axes[1,1])
axes[1,1].set_xlabel('debt-to-income')
axes[1,1].set_title('debt-to-income by loan status(filtered <100%)')


plt.legend()
plt.show()

print("Skewness: %f" % xtrain['dti'].skew())
print("Kurtosis: %f" % xtrain['dti'].kurt())
Skewness: 29.252373
Kurtosis: 1750.309492
In [63]:
print("Debt-to-income ratio")
print("Good loans", xtrain.loc[xtrain['target']==0,'dti'].median())
print("Bad loans", xtrain.loc[xtrain['target']==1,'dti'].median())
Debt-to-income ratio
Good loans 17.56
Bad loans 19.65

Since the debt-to-income ratio is highly right skewed, I filtered for data with debt-to-income less than 100% to get a closer look to the majority of the data. The debt-to-income ratio has a distribution really close to a normal distribution between 0 and 40%. On average, the debt-to-income ratio is higher for the bad loans than the good loans.

In [64]:
print('Debt-to-income(dti) Quantiles:')
print("Low dti (below 25%): dti <",  np.quantile(np.array(xtrain['dti'].dropna()), 0.25) )
print("Median dti (between 25% and 75%):", np.quantile(np.array(xtrain['dti'].dropna()), 0.25),
       "<= dti <",  np.quantile(np.array(xtrain['dti'].dropna()), 0.75))
print("High dti (above 75% below 95%): ", np.quantile(np.array(xtrain['dti'].dropna()), 0.75),
       "<= dti <", np.quantile(np.array(xtrain['dti'].dropna()), 0.95))
print("Extreme high dti (Above 95%): dti >",
      np.quantile(np.array(xtrain['dti'].dropna()), 0.95))
Debt-to-income(dti) Quantiles:
Low dti (below 25%): dti < 11.89
Median dti (between 25% and 75%): 11.89 <= dti < 24.48
High dti (above 75% below 95%):  24.48 <= dti < 33.88
Extreme high dti (Above 95%): dti > 33.88
In [65]:
# Count good and bad loans by state
dti_df = xtrain[['dti', 'target']]
dti_df.loc[dti_df['dti'] < 11.89,'dti_cat'] = 'low'
dti_df.loc[(dti_df['dti'] >= 11.89)
              & (dti_df['dti'] < 24.48),'dti_cat'] = 'median'
dti_df.loc[(dti_df['dti'] >= 24.48)
              & (dti_df['dti'] < 33.88),'dti_cat'] = 'high'
dti_df.loc[dti_df['dti'] > 33.88,'dti_cat'] = 'xhigh'
In [66]:
#Good vs Bad Loans by Income Level 
dti_table = dti_df.groupby(['dti_cat', 'target']).size().reset_index().pivot(
    columns='target', index='dti_cat', values=0).sort_values(by=1, ascending = False)
ax = dti_table.plot.bar(stacked=True ,figsize= (8, 6))
                             
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Debt-to-Income Level", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Borrower's Debt-to-Income Level by Loan status", fontsize=15)
#sizes=[]

for i in range(4):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.2f}%'.format(height/(height + ax.patches[i+4].get_height()) *100),
            ha="center", fontsize=10)
    
for i in range(4,8):
    p = ax.patches[i]
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.text(x + width/2.,
            y + height/2,
            '{:1.2f}%'.format(height/(height + ax.patches[i-4].get_height()) *100),
            ha="center", fontsize=10)
plt.show()

Dividing loans into 4 groupbs based on borrowers' debt-to-income ratio:

  • Low dti (below 25%): dti < 11.89
  • Median dti (between 25% and 75%): 11.89 <= dti < 24.48
  • High dti (above 75% below 95%): 24.48 <= dti < 33.88
  • Extreme high dti (Above 95%): dti > 33.88

For high and extreme-high debt-to-income ratio groups, the percentage of loans that are in bad-condition(label = 1) is the highest, above 16%. The group of loans whose borrowers have median debt-to-income ratio has about 13% of chance of turning in bad loans. The borrowers has low debt-to-income ratio is least likely to bad loans (10.6%).

What the plot above shows is: the higher debt-to-income ratio, the higher chance that the loans in bad-condition.

Loan Purpose

  • Top 3 reasons of people borrows on Lending club: Debt reconsolidation, Credit card payment and home improvement
In [67]:
# Loan purpose distribution
plt.figure(figsize = (10,6))
ax = sns.countplot(x="purpose", data=xtrain, order = xtrain.purpose.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Loan Purpose", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Purpose Distribution", fontsize=15)
sizes=[]
for p in ax.patches:
    height = p.get_height()
    sizes.append(height)
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            '{:1.2f}%'.format(height/len(xtrain)*100),
            ha="center", fontsize=10) 

Noted that about 56% of the loans are used for debt_consolidation, 22.87% for credit card, and 6.66% for home_improvement

Colinearity

To detect colinearity among variables, we can simply create a correlation matrix and find variable pairs with large absolute values. However, multicolinearity is much harder to detect with this method as it involves more than 2 variables. Right now, we focus on identifying colinearity within the data. Later, we will apply Variance Inflation Factor (VIF) method to detect multicolinearity.

In [68]:
# Separate numerica variables and categorical variables
cols = xtrain.columns.tolist()
num_cols = xtrain._get_numeric_data().columns.tolist()
cat_cols = [x for x in cols if x not in num_cols] 

Correlation analysis with pearson method

In [69]:
#  Correlation analysis with pearson method
xtrain_pcorr_matrix = xtrain[num_cols].corr(method="pearson")
In [70]:
with open('xtrain_pcorr.pickle', 'wb') as f:
    pickle.dump(xtrain_pcorr_matrix, f)
# with open('xtrain_pcorr.pickle', 'rb') as f:
#     xtrain_pcorr_matrix = pickle.load(f)
In [71]:
# High correlation with target variables using pearson method
pcorr_target = xtrain_pcorr_matrix["target"].abs().sort_values(kind="quicksort", ascending=False)
pcorr_target.head(12)
Out[71]:
target                     1.000000
int_rate                   0.211981
sec_app_fico_range_low     0.138020
sec_app_fico_range_high    0.138019
fico_range_low             0.122275
fico_range_high            0.122273
issue_year                 0.116399
sec_app_inq_last_6mths     0.099210
acc_open_past_24mths       0.094778
inq_last_6mths             0.085028
num_tl_op_past_12m         0.083668
bc_open_to_buy             0.082026
Name: target, dtype: float64
In [72]:
# Unstack covariance matrix pairs
xtrain_pcorr_sorted = xtrain_pcorr_matrix.abs().unstack().sort_values(kind="quicksort",
                                                                      ascending=False)
In [73]:
# Table of variable pairs and correlation coefficients - pearson method
pcorr_df = pd.DataFrame([xtrain_pcorr_sorted[xtrain_pcorr_sorted != 1.0].drop_duplicates()]).T.reset_index()
# Sorted for highest correlated variable pairs
pcorr_df.head(10)
Out[73]:
level_0 level_1 0
0 sec_app_fico_range_high sec_app_fico_range_low 1.000000
1 fico_range_low fico_range_high 1.000000
2 funded_amnt loan_amnt 0.999757
3 num_sats open_acc 0.998965
4 num_actv_rev_tl num_rev_tl_bal_gt_0 0.983665
5 tot_hi_cred_lim tot_cur_bal 0.975460
6 total_bal_il total_il_high_credit_limit 0.952008
7 funded_amnt installment 0.945992
8 installment loan_amnt 0.945652
9 total_bal_ex_mort total_bal_il 0.898994
In [74]:
# Create two lists, one with correlation higher than 0.7 , one with correlation higher than 0.9
# Compare the two variables in each high-correlation pairs, and choose the variable with the higher average correlations with all other variables within the dataset.
pcorr9_cols=set()
pcorr7_cols=set()
for i in range(len(pcorr_df)):
    if pcorr_df[0][i] >= 0.7:
        if xtrain_pcorr_matrix[pcorr_df.level_0[i]].abs().mean()> xtrain_pcorr_matrix[pcorr_df.level_1[i]].abs().mean():
            pcorr7_cols.add(pcorr_df.level_0[i])     
        else :
            pcorr7_cols.add(pcorr_df.level_1[i])
            
    if pcorr_df[0][i] >= 0.9:
        if xtrain_pcorr_matrix[pcorr_df.level_0[i]].abs().mean()> xtrain_pcorr_matrix[pcorr_df.level_1[i]].abs().mean():
            pcorr9_cols.add(pcorr_df.level_0[i])     
        else :
            pcorr9_cols.add(pcorr_df.level_1[i])
            
              
In [75]:
# variables with corr >= 0.7
pcorr7_cols
Out[75]:
{'acc_now_delinq',
 'acc_open_past_24mths',
 'annual_inc_joint',
 'bc_util',
 'fico_range_low',
 'funded_amnt',
 'loan_amnt',
 'mort_acc',
 'mths_since_last_delinq',
 'mths_since_recent_bc_dlq',
 'num_actv_rev_tl',
 'num_bc_sats',
 'num_op_rev_tl',
 'num_rev_accts',
 'num_rev_tl_bal_gt_0',
 'num_sats',
 'num_tl_op_past_12m',
 'open_acc',
 'open_il_24m',
 'open_rv_24m',
 'revol_util',
 'sec_app_fico_range_high',
 'sec_app_open_acc',
 'tot_cur_bal',
 'tot_hi_cred_lim',
 'total_acc',
 'total_bal_ex_mort',
 'total_bc_limit',
 'total_il_high_credit_limit',
 'total_rev_hi_lim'}
In [76]:
# variables with corr >=0.9
pcorr9_cols
Out[76]:
{'fico_range_low',
 'funded_amnt',
 'loan_amnt',
 'num_actv_rev_tl',
 'open_acc',
 'sec_app_fico_range_high',
 'tot_hi_cred_lim',
 'total_il_high_credit_limit'}

Correlation analysis with spearman method

In [77]:
#  Correlation analysis with spearman method
xtrain_spcorr_matrix = xtrain[num_cols].corr(method="spearman")
xtrain_spcorr_sorted = xtrain_spcorr_matrix.abs().unstack().sort_values(kind="quicksort",
                                                                        ascending=False)
In [87]:
# save the correlation matrix
with open('xtrain_spearcorr.pickle', 'wb') as f:
    pickle.dump([xtrain_spcorr_matrix, xtrain_spcorr_sorted], f)
In [79]:
# with open('xtrain_pcorr.pickle', 'rb') as f:
#     xtrain_spcorr_matrix, xtrain_spcorr_sorted = pickle.load(f)
In [80]:
# High correlation with target variables using spearman method
spcorr_target = xtrain_spcorr_matrix["target"].abs().sort_values(kind="quicksort", 
                                                                 ascending=False)
spcorr_target.head(10)
Out[80]:
target                     1.000000
int_rate                   0.210503
issue_year                 0.135719
sec_app_fico_range_high    0.131136
sec_app_fico_range_low     0.131136
fico_range_low             0.124864
fico_range_high            0.124864
bc_open_to_buy             0.096619
acc_open_past_24mths       0.095575
sec_app_inq_last_6mths     0.086528
Name: target, dtype: float64
In [81]:
# Table of variable pairs and correlation coefficients - pearson method
spcorr_df = pd.DataFrame([xtrain_spcorr_sorted[xtrain_spcorr_sorted != 1.0].drop_duplicates()]).T.reset_index()
# Sorted for highest correlated variable pairs
spcorr_df.head(10)
Out[81]:
level_0 level_1 0
0 loan_amnt funded_amnt 0.999777
1 num_sats open_acc 0.998739
2 num_actv_rev_tl num_rev_tl_bal_gt_0 0.991042
3 tot_cur_bal tot_hi_cred_lim 0.972331
4 installment funded_amnt 0.964788
5 loan_amnt installment 0.964446
6 total_il_high_credit_limit total_bal_il 0.956190
7 tot_cur_bal avg_cur_bal 0.925816
8 mths_since_recent_revol_delinq mths_since_recent_bc_dlq 0.886944
9 total_bal_ex_mort total_bal_il 0.882222
In [82]:
# Compare the two variables in each high-correlation pairs, and choose the variable with the higher average correlations with all other variables within the dataset.
spcorr7_cols=set()
for i in range(len(spcorr_df)):
    if spcorr_df[0][i]>=0.7: 
        if xtrain_spcorr_matrix[spcorr_df.level_0[i]].abs().mean()> xtrain_spcorr_matrix[spcorr_df.level_1[i]].abs().mean():
            spcorr7_cols.add(spcorr_df.level_0[i]) 
        else :
            spcorr7_cols.add(spcorr_df.level_1[i])
      
        
spcorr9_cols=set()
for i in range(len(spcorr_df)):
    if spcorr_df[0][i]>=0.9:
        if xtrain_spcorr_matrix[spcorr_df.level_0[i]].abs().mean()> xtrain_spcorr_matrix[spcorr_df.level_1[i]].abs().mean():
            spcorr9_cols.add(spcorr_df.level_0[i]) 
        else :
            spcorr9_cols.add(spcorr_df.level_1[i])

            
          
In [83]:
# variables with correlation coefficients >0.9 (spearman method)
spcorr9_cols
Out[83]:
{'funded_amnt',
 'loan_amnt',
 'num_actv_rev_tl',
 'num_sats',
 'tot_cur_bal',
 'tot_hi_cred_lim',
 'total_bal_il'}
In [84]:
# Combine the results of both pearson and spearman method (corr > 0.7)
corr7_vars = list(set(spcorr7_cols.union(pcorr7_cols)))
corr7_vars
Out[84]:
['open_acc_6m',
 'mths_since_last_delinq',
 'num_sats',
 'num_tl_op_past_12m',
 'total_bal_il',
 'loan_amnt',
 'total_il_high_credit_limit',
 'annual_inc_joint',
 'fico_range_low',
 'open_acc',
 'bc_open_to_buy',
 'delinq_amnt',
 'delinq_2yrs',
 'num_actv_rev_tl',
 'open_il_24m',
 'sec_app_open_acc',
 'mort_acc',
 'mths_since_recent_bc_dlq',
 'tot_cur_bal',
 'bc_util',
 'acc_now_delinq',
 'acc_open_past_24mths',
 'num_bc_sats',
 'pub_rec',
 'num_il_tl',
 'num_rev_tl_bal_gt_0',
 'mo_sin_rcnt_rev_tl_op',
 'funded_amnt',
 'total_bc_limit',
 'sec_app_fico_range_high',
 'num_op_rev_tl',
 'mths_since_recent_revol_delinq',
 'total_rev_hi_lim',
 'total_acc',
 'open_rv_12m',
 'revol_bal',
 'mo_sin_rcnt_tl',
 'tot_hi_cred_lim',
 'open_rv_24m',
 'mths_since_rcnt_il',
 'num_rev_accts',
 'revol_util',
 'total_bal_ex_mort']
In [85]:
# List of variables with correlation greater than 0.9(combining pearson and spearman method )
corr9_vars = list(set(spcorr9_cols.union(pcorr9_cols)))
corr9_vars
Out[85]:
['num_sats',
 'tot_hi_cred_lim',
 'total_bal_il',
 'funded_amnt',
 'loan_amnt',
 'num_actv_rev_tl',
 'total_il_high_credit_limit',
 'sec_app_fico_range_high',
 'fico_range_low',
 'tot_cur_bal',
 'open_acc']

Dimension Reduction Consideration

Correlation analysis provides us a good start in understanding how closely variables are related to each other. Data redundancy exists when a dataset has variables that are highly correlated with other variables. We can do dimension reduction with the result from correlation analysis because it not only help us remove data redundancy, speed up processing time, but also reduce model complexity and prevent model overfitting issues. As we are going to investigate multicollinearity issue with Variance Inflation Factor (VIF) method, I am going to perform dimension reduction later by combining the result of both correlation analysis and Variance Inflation Factor analysis.

Pearson and Spearman correlation method only works for calculating correlation between the numeric varibles. For categorical variables, we can test if they are independent with Chi-Squared test of independence. For Categorical vs Numerical Variables, we can perform One-way ANOVA test by calculating in-group variance and intra-group variance and then compare them.