a.Identify most important attributes separating bad loans and good loans
b.Build Xgboost model to make prediction
a.Data first impression
-available predicting variables
-data distribution
-missing values
b.Split dataset in to training and test set
c.correlation analysis and remove multicollinearity
d.Data Visualization to explore relationship between target and predicting variables
a. Handling missing values
b. Transform any characteristics or categorical variables into numeric
c. Create new features from existing features
- Standard scale
- Handling Dataset imbalance issues (upsampling)
1). Logistic regression with L1 regularization model
2). Random Forest model
3). Xgboost model
-hyperparameters Tuning
- Remove variables according to correlation analysis
- Logistic regression with L1 regularization (coeffecients not zero)
- Random Forest model built-in feature importance
The project is split into 3 notebooks. The notebook is part one, focusing on Data Exploratory Analysis
import re
import pickle
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import os
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
matplotlib.rcParams.update({'font.size': 10})
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
from sklearn.linear_model import RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTENC
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
# load data
data = pd.read_csv("./902_370089_bundle_archive/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv")
data.shape
data.head()
print(data.info(verbose =True, max_cols = True,null_counts=True))
# Loading in the data dictionary
xls = pd.ExcelFile('LCDataDictionary.xlsx')
dict1 = pd.read_excel(xls, 'LoanStats')
dict2 = pd.read_excel(xls, 'browseNotes')
# Column Names in the data dictionary
print(dict1.columns.tolist())
print(dict2.columns.tolist())
dict1.loc[dict1['LoanStatNew'].isnull(),:]
dict2.loc[dict2['BrowseNotesFile'].isnull(),:]
# Remove NaN from data dictionary LoanStatNew
dict1.dropna(axis=0, subset=['LoanStatNew'], inplace= True)
# Remove NaN from data dictionary LoanStatNew
dict2.dropna(axis=0, subset=['BrowseNotesFile'], inplace= True)
print('Number of Features in LoanStatNew:' ,dict1.shape[0])
print('Number of Features in BrowseNotesFile:', dict2.shape[0])
# available columns in data dictionary LoanStatNew
dict1['LoanStatNew'].tolist()
# available columns in data dictionary LoanStatNew
dict2['BrowseNotesFile'].tolist()
# All columns in the dataset
data.columns.tolist()
What is data leakage in Machine Learning?
Some of your predictors (features/indepedent variables) that you are using in training your predictive models happen to have information about your target variables. See more here.
Noted that some of the columns/features in the dataset are not available at the moment of investing until the loans have been issued. So these features should be removed, otherwise they would leek information from the future. We can use the variables that are in the data dictionary BrowseNoteFile, which lists only information available to investors when they are deciding whether to invest to any loans.
Noted that some features were named slightly different between the data dictionary sheet and the column names in the dataset but they referred to the same thing, so we will need to match the format between these column names.
# Noted format differences between data dictionary BrowseNote and columns name in the dataset
index2 = [x not in dict2['BrowseNotesFile'].tolist() for x in data.columns.tolist()]
pd.Series(data.columns.tolist())[index2].head()
# Match the format of data dictionary to the columns names in the dataset
dictionary = dict2['BrowseNotesFile'].dropna().values
dictionary = [re.sub('(?<![0-9_])(?=[A-Z0-9])', '_', x).lower().strip() for x in dictionary]
# Find features that are in dataset but not in the BrowseNote
index3 = [x not in dictionary for x in data.columns.tolist()]
pd.Series(data.columns.tolist())[index3].head()
# # The features that are in the BrowseNote but not in dataset
index4 = [x not in data.columns.tolist()for x in dictionary]
pd.Series(dictionary)[index4]
# Manually match the spelling between column names in dataset and data dictionary 'BrowseNote'
uni_dict = ['is_inc_v','mths_since_oldest_il_open','mths_since_most_recent_inq',
'mths_since_recent_loan_delinq','verified_status_joint']
uni_dataset = ['verification_status', 'mo_sin_old_il_acct','mths_since_recent_inq',
'mths_since_recent_bc_dlq', 'verification_status_joint']
dictionary = np.setdiff1d(dictionary, uni_dict)
dictionary = np.append(dictionary,uni_dataset)
Create a list containing all features that are available in the dataset and also available to investors at the moment of investing
dictionary = np.intersect1d(dictionary, data.columns.tolist())
# Skim through the available features
dictionary
Noted that Issue_d is the only column in the dataset that reveals when the loan was originated. Althought it's the date when the loan was issued (information from the future), we are going to extract the year and month of the issue date, which would represent the time period when investors were making investing decisions about that loans.
# Adding back the target variable 'loan status' and 'issue_d' to the dictionary list
dictionary = np.append(dictionary, np.array(['loan_status', 'issue_d' ]))
# Drop features that are not available at the moment of investing,except for loan_status(target variable)
cols_to_remove = [x not in dictionary for x in data.columns.tolist()]
cols_to_remove = pd.Series(data.columns.tolist())[cols_to_remove]
data.drop(columns=cols_to_remove, inplace = True)
Now we have a dataset with all features that are available to investors at the moment of investing after removing features that leeks information from the future.
data.shape
# Check if there is any missing values in the target variable, loan status
data.loan_status.isnull().sum()
#Check data where loan status is NaN. Noted there is no loan information for any of these rows.
data.loc[data.loan_status.isnull(),:].head()
# remove rows where loan status is missing
data.dropna(axis=0, subset=['loan_status'], inplace= True)
# Let's check the distribution of target variable, loan status
data.loan_status.value_counts()
# Loan status distribution
plt.figure(figsize = (10,6))
ax = sns.countplot(x="loan_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Loan Status", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Status Types Distribution", fontsize=15)
sizes=[]
for p in ax.patches:
height = p.get_height()
sizes.append(height)
ax.text(p.get_x()+p.get_width()/2.,
height + 4,
'{:1.2%}'.format(height/len(data)),
ha="center", fontsize=10)
There are 9 loan statuses:
We consider "Fully Paid", "Current", and "Does not meet the credit policy. Status:Fully Paid" as a good-condition loans and everything else as bad-condition loans.
# define bad_indicators
bad_indicators = ["Charged Off ",
"Charged Off",
"Default",
"Does not meet the credit policy. Status:Charged Off",
"In Grace Period",
"Default Receiver",
"Late (16-30 days)",
"Late (31-120 days)"]
# Creat a new target variable
data['target'] = data['loan_status'].isin(bad_indicators).astype(int)
# "Good" and "Bad" Loans Distribution
ax = sns.countplot(x="target", data=data)
ax.set_xlabel("Loan Status Categories", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
ax.set_title("Loan Status Types Distribution", fontsize=18)
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 4,
'{:1.2%}'.format(height/len(data)),
ha="center", fontsize=10)
Noted the dataset is imbalanced with the minority group consisting of 13.43% of the population compared to the majarity group of 86.57%.
y = data['target']
# Split the dataset into training and test dataset with random sampling
xtrain, xtest, ytrain, ytest = train_test_split(data, y, test_size=0.20,
random_state=1, stratify=y)
#Make sure the training and test set have similar distribution of majority and minority group
ytrain.mean()
ytest.mean()
# save train, test sets for later use
xtrain.to_csv("xtrain.csv", index=False)
xtest.to_csv("xtest.csv", index=False)
ytrain.to_csv("ytrain.csv", index=False)
ytest.to_csv("ytest.csv", index=False)
Let's explore the data to see how features affect loan conditions. I will be focusing on the followings:
loan amount vs funded amount
Interest rate
loans terms
loan issue year
Borrower's address (state)
Borrower's Annual Income
Borrower's Employment Length
Loan Grade
Home ownership
Debt-to-income ratioo
Loan Purpose
# Compare Loan Amount and Funded Amount
f, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
sns.distplot(xtrain['loan_amnt'], hist=True, kde=True,
bins=int(180/5), color="b",
kde_kws={'linewidth': 2}, ax = axes[0])
axes[0].set_title('Loan Amount (in 1000s)')
axes[0].set_xlabel('loan_amnt')
sns.distplot(xtrain['funded_amnt'], hist=True, kde=True,
bins=int(180/5), color="g",
kde_kws={'linewidth': 2}, ax = axes[1])
axes[1].set_title('Funded Amount (in 1000s)')
axes[1].set_xlabel('funded_amnt')
plt.show()
print("Loan Amount")
print("Skewness: %f" % xtrain['loan_amnt'].skew())
print("Kurtosis: %f" % xtrain['loan_amnt'].kurt())
print("Funded Amount")
print("Skewness: %f" % xtrain['funded_amnt'].skew())
print("Kurtosis: %f" % xtrain['funded_amnt'].kurt())
print("Average Loan Amount : " + str(round(xtrain.loan_amnt.mean(),2)))
# Check distribution of loan interest rate
f, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
sns.distplot(xtrain['int_rate'], hist=True, kde=True,
bins=int(180/5),
kde_kws={'linewidth': 2},ax = axes[0])
axes[0].set_xlabel('int_rate')
axes[0].set_title('int_rate')
sns.distplot(xtrain.loc[xtrain['target']==0,'int_rate'], hist=True, kde=True,
bins=int(180/5), label='0',
kde_kws={'linewidth': 2},ax = axes[1])
sns.distplot(xtrain.loc[xtrain['target']==1,'int_rate'], hist=True, kde=True,
bins=int(180/5), label='1',
kde_kws={'linewidth': 2},ax = axes[1])
axes[1].set_xlabel('int_rate')
axes[1].set_title('int_rate by loan status')
plt.legend()
plt.show()
print("Skewness: %f" % xtrain['int_rate'].skew())
print("Kurtosis: %f" % xtrain['int_rate'].kurt())
print('Average Loan Interest Rate : ')
print('Good Loans:', xtrain.loc[xtrain['target']==0,'int_rate'].mean())
print('Bad Loans:', xtrain.loc[xtrain['target']==1,'int_rate'].mean())
Based on the distribution plot and mean statistics, bad loans tend to have a high interest rate than good loans of lending club.
# Loan interest rate by loan grade
ax = sns.boxplot(x="grade", y="int_rate", data=xtrain, order=["A","B","C","D","E","F","G"])
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("Loan Grade Categories", fontsize=10)
ax.set_ylabel("Loan IR", fontsize=10)
ax.set_title("Loan IR by Loan Grade", fontsize=15)
plt.show()
The plots shows that the better grade of loans, the lower interest rate.
# Loan term by loan status
ax = sns.countplot( x="term", hue="target", data=xtrain)
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("Loan Status Categories", fontsize=10)
ax.set_ylabel("Loan Term", fontsize=10)
ax.set_title("Loan Term by Loan Status", fontsize=15)
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
'{:1.2f}%'.format(height/len(xtrain)*100),
ha="center", fontsize=10)
The plot above shows that a loan with shorter loan term (36 months compared to 60 months) is more like to turn into a bad loan.
# Convert issue date to year and month separate variables
xtrain["issue_year"]= pd.to_datetime(xtrain.issue_d).dt.year
#xtrain['issue_y'] = pd.to_datetime(issue_d).dt.year.dropna().astype('int32')
#Total Loans by Years - Good vs Bad
plt.figure(figsize=(10,6))
ax = sns.countplot(x = 'issue_year', data=xtrain)
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("Years", fontsize=10)
ax.set_ylabel("Count Loans", fontsize=10)
ax.set_title("Loans Issued by Years ", fontsize=15)
sizes=[]
for p in ax.patches:
height = p.get_height()
sizes.append(height)
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
'{:1.2f}%'.format(height/len(xtrain)*100),
ha="center", fontsize=12)
ax.set_ylim(0, max(sizes) * 1.10)
plt.show()
Address (state)
Income
Average length of employment
Loan Grade
Home ownership
debt-to-income
Loan purpose
#Total Loans by States - Good vs Bad
plt.figure(figsize=(26,14))
ax = sns.countplot(x='addr_state', data=xtrain,
order = xtrain['addr_state'].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("States", fontsize=10)
ax.set_ylabel("Count Loans", fontsize=10)
ax.set_title("Total Loans by Borrower's Address(States)", fontsize=15)
sizes=[]
for p in ax.patches:
height = p.get_height()
sizes.append(height)
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
'{:1.1f}%'.format(height/len(xtrain)*100),
ha="center", fontsize=12)
ax.set_ylim(0, max(sizes) * 1.10)
plt.show()
# Count good and bad loans by state
state_df = xtrain.groupby(['addr_state', 'target']).size().reset_index().pivot(
columns='target', index='addr_state', values=0).sort_values(by=1, ascending = False)
#Total Loans by States - Good vs Bad
ax = state_df.plot.bar(stacked=True ,figsize= (18, 12))
#ax.set_xticklabels(ax.get_xticklabels())
ax.set_xlabel("States", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Total Loans by Borrower's Address(States) - Good vs Bad Loans", fontsize=15)
#sizes=[]
for i in range(51):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.0f}%'.format(height/(height + ax.patches[i+51].get_height()) *100),
ha="center", fontsize=10)
for i in range(51,102):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.0f}%'.format(height/(height + ax.patches[i-51].get_height()) *100),
ha="center", fontsize=10)
plt.show()
state_df['default_rate'] =state_df[1]/(state_df[0]+state_df[1])
# Top 10 states with the lowest average default rate
state_df.sort_values(by='default_rate').head(10)
# Top 10 states with the highest average default rate
state_df.sort_values(by='default_rate', ascending = False).head(10)
Both the bar plots and summary table (default rate by state) show that some state have higher default rate than others.
# Check distribution of Borrower's Annual Income
# These plots are not very understandable becuase of skewness and outliers
f, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
sns.distplot(xtrain['annual_inc'], hist=True, kde=True, bins=int(180/5),
kde_kws={'linewidth': 2},ax = axes[0])
axes[0].set_xlabel('annual income')
axes[0].set_title('Distribution of Annual Income')
sns.distplot(xtrain.loc[xtrain['target']==0,'annual_inc'], hist=True, kde=True,
bins=int(180/5), label='0', kde_kws={'linewidth': 2},ax = axes[1])
sns.distplot(xtrain.loc[xtrain['target']==1,'annual_inc'], hist=True, kde=True,
bins=int(180/5), label='1', kde_kws={'linewidth': 2},ax = axes[1])
axes[1].set_xlabel('annual_inc')
axes[1].set_title('Annual Income by loan status')
plt.legend()
plt.show()
print("Skewness: %f" % xtrain['annual_inc'].skew())
print("Kurtosis: %f" % xtrain['annual_inc'].kurt())
# Check distribution of Borrower's Annual Income, filtered for annual_inc < $2,000,000
f, axes = plt.subplots(1, 2, figsize=(10, 4), sharey=True)
sns.distplot(xtrain.loc[xtrain['annual_inc'] < 2000000,'annual_inc'], hist=True, kde=True,
bins=int(180/5), kde_kws={'linewidth': 2},ax = axes[0])
axes[0].set_xlabel('annual income')
axes[0].set_title('Distribution of Annual Income ( < $2,000,000)')
sns.distplot(xtrain.loc[(xtrain['target']==0) & (xtrain['annual_inc'] < 2000000),'annual_inc'], hist=True, kde=True,
bins=int(180/5), label='0',kde_kws={'linewidth': 2},ax = axes[1])
sns.distplot(xtrain.loc[(xtrain['target']==1) & (xtrain['annual_inc'] < 2000000),'annual_inc'], hist=True, kde=True,
bins=int(180/5), label='1',kde_kws={'linewidth': 2},ax = axes[1])
axes[1].set_xlabel('annual_inc')
axes[1].set_title('Annual Income by loan status ( < $2,000,000)')
plt.legend()
plt.show()
print("Skewness: %f" % xtrain['annual_inc'].skew())
print("Kurtosis: %f" % xtrain['annual_inc'].kurt())
print('Annual Income Quantiles:')
print("Low Income (below 25%): Annual_Inc <", np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.25) )
print("Median Income (between 25% and 75%):", np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.25),
"< Annual_Inc <",
np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.75))
print("High Income (above 75% below 95%): ", np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.75),
"< Annual_Inc <",
np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.95))
print("Extreme high income (Above 95%): Annual_Inc >",
np.quantile(np.array(xtrain['annual_inc'].dropna()), 0.95))
# Count good and bad loans by state
income_df = xtrain[['annual_inc', 'target']]
income_df.loc[income_df['annual_inc'] < 46000.0,'inc_cat'] = 'low_inc'
income_df.loc[(income_df['annual_inc'] >= 46000.0)
& (income_df['annual_inc'] < 93000.0),'inc_cat'] = 'median_inc'
income_df.loc[(income_df['annual_inc'] >= 93000.0)
& (income_df['annual_inc'] < 160000.0),'inc_cat'] = 'high_inc'
income_df.loc[income_df['annual_inc'] > 160000.0,'inc_cat'] = 'xhigh_inc'
#Good vs Bad Loans by Income Level
income_table = income_df.groupby(['inc_cat', 'target']).size().reset_index().pivot(
columns='target', index='inc_cat', values=0).sort_values(by=1, ascending = False)
ax = income_table.plot.bar(stacked=True ,figsize= (8, 6))
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Income Level", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Loans by Borrower's Income Level - Good vs Bad Loans", fontsize=15)
#sizes=[]
for i in range(4):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.2f}%'.format(height/(height + ax.patches[i+4].get_height()) *100),
ha="center", fontsize=10)
for i in range(4,8):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.2f}%'.format(height/(height + ax.patches[i-4].get_height()) *100),
ha="center", fontsize=10)
plt.show()
We can see from the plot above, the higher the income level, the lower the loan default rate.
#Good vs Bad Loans by emp_length
emp_length_table = xtrain.groupby(['emp_length', 'target']).size().reset_index().pivot(
columns='target', index='emp_length', values=0).sort_values(by=1, ascending = False)
ax = emp_length_table.plot.bar(stacked=True ,figsize= (8, 6))
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("employment length", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Loans by Borrower's employment length - Good vs Bad Loans", fontsize=15)
#sizes=[]
n = len(ax.patches)
for i in range(int(n/2)):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.2f}%'.format(height/(height + ax.patches[i+ int(n/2)].get_height()) *100),
ha="center", fontsize=10)
for i in range(int(n/2),n):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.2f}%'.format(height/(height + ax.patches[i- int(n/2)].get_height()) *100),
ha="center", fontsize=10)
plt.show()
# The employment length of largest two groups of borrowers
print('Percentage of borrowers ')
print('emp_length > 10+ ys: ', round((xtrain.emp_length.value_counts()/len(xtrain)).values[0],4))
print('emp_length <= 3 ys: ', round(sum((xtrain.emp_length.value_counts()/len(xtrain)).values[1:4]),4))
There is no clear pattern between loan default and employment length. Noted that both people with long employment length(10+ years) and short employment length(<= 3 years) tend to borrow more.
# Loan grade distribution
plt.figure(figsize = (8,4))
ax = sns.countplot(x="grade", data=xtrain, order= xtrain.grade.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),horizontalalignment='right')
ax.set_xlabel("Loan Grade", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Grade Distribution", fontsize=15)
sizes=[]
for p in ax.patches:
height = p.get_height()
sizes.append(height)
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
'{:1.2f}%'.format(height/len(xtrain)*100),
ha="center", fontsize=10)
The majority of the loans are graded as A, B, C
#Loan Grade between 'good' and 'bad' loans
plt.figure(figsize = (10,6))
ax = sns.countplot(x="grade",hue="target", data=xtrain, order = xtrain.grade.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(),horizontalalignment='right')
ax.set_xlabel("Loan Grade", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Grade Distribution", fontsize=15)
sizes=[]
n = len(ax.patches)
for i in range(int(n/2)):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height,
'{:1.1f}%'.format(height/(height + ax.patches[i+ int(n/2)].get_height()) *100),
ha="center", fontsize=10)
for i in range(int(n/2),n):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height,
'{:1.1f}%'.format(height/(height + ax.patches[i- int(n/2)].get_height()) *100),
ha="center", fontsize=10)
plt.show()
From the plot above, noted that the better the loan grade, the lower the "bad" loan ratios.
#Homeownership and Loan Status
plt.figure(figsize = (10,6))
ax = sns.countplot(x="home_ownership",hue="target", data=xtrain)
ax.set_xticklabels(ax.get_xticklabels(),horizontalalignment='right')
ax.set_xlabel("home_ownership", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Home ownership by loan status", fontsize=15)
sizes=[]
n = len(ax.patches)
for i in range(int(n/2)):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height,
'{:1.1f}%'.format(height/(height + ax.patches[i+ int(n/2)].get_height()) *100),
ha="center", fontsize=10)
for i in range(int(n/2),n):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height,
'{:1.1f}%'.format(height/(height + ax.patches[i- int(n/2)].get_height()) *100),
ha="center", fontsize=10)
plt.show()
# Check distribution of loan interest rate
f, axes = plt.subplots(2, 2, figsize=(10, 6), sharey=True)
plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.5, wspace=0.4)
# Histogram 1 - debt-to-income ratio
sns.distplot(xtrain['dti'], hist=True, kde=True,
bins=int(180/5),
kde_kws={'linewidth': 2},ax = axes[0, 0])
axes[0,0].set_xlabel('debt-to-income ratio')
axes[0,0].set_title('Distribution of debt-to-income ratio')
# Histogram 2 - debt-to-income ratio between bad and good loans
sns.distplot(xtrain.loc[xtrain['target']==0,'dti'], hist=True, kde=True,
bins=int(180/5), label='0',
kde_kws={'linewidth': 2},ax = axes[0,1 ])
sns.distplot(xtrain.loc[xtrain['target']==1,'dti'], hist=True, kde=True,
bins=int(180/5), label='1',
kde_kws={'linewidth': 2},ax = axes[0, 1])
axes[0,1].set_xlabel('debt-to-income')
axes[0,1].set_title('debt-to-income by loan status')
# Histogram 3 - debt to income ratio filtered for less-than-100-percent
sns.distplot(xtrain.loc[xtrain['dti']<100,"dti"], hist=True, kde=True,
bins=int(180/5),
kde_kws={'linewidth': 2},ax = axes[1,0])
axes[1,0].set_xlabel('debt-to-income ratio')
axes[1,0].set_title('Debt-to-income ratio (filtered <100%)')
#Histogram 4 - debt-to-income ratio between good and bad loans filtered for less-than-100-percent
sns.distplot(xtrain.loc[(xtrain['dti']<100) & (xtrain['target']==0),'dti'],
hist=True, kde=True,
bins=int(180/5), label='0',
kde_kws={'linewidth': 2},ax = axes[1,1])
sns.distplot(xtrain.loc[(xtrain['dti']<100) & (xtrain['target']==1),'dti'],
hist=True, kde=True,
bins=int(180/5), label='1',
kde_kws={'linewidth': 2},ax = axes[1,1])
axes[1,1].set_xlabel('debt-to-income')
axes[1,1].set_title('debt-to-income by loan status(filtered <100%)')
plt.legend()
plt.show()
print("Skewness: %f" % xtrain['dti'].skew())
print("Kurtosis: %f" % xtrain['dti'].kurt())
print("Debt-to-income ratio")
print("Good loans", xtrain.loc[xtrain['target']==0,'dti'].median())
print("Bad loans", xtrain.loc[xtrain['target']==1,'dti'].median())
Since the debt-to-income ratio is highly right skewed, I filtered for data with debt-to-income less than 100% to get a closer look to the majority of the data. The debt-to-income ratio has a distribution really close to a normal distribution between 0 and 40%. On average, the debt-to-income ratio is higher for the bad loans than the good loans.
print('Debt-to-income(dti) Quantiles:')
print("Low dti (below 25%): dti <", np.quantile(np.array(xtrain['dti'].dropna()), 0.25) )
print("Median dti (between 25% and 75%):", np.quantile(np.array(xtrain['dti'].dropna()), 0.25),
"<= dti <", np.quantile(np.array(xtrain['dti'].dropna()), 0.75))
print("High dti (above 75% below 95%): ", np.quantile(np.array(xtrain['dti'].dropna()), 0.75),
"<= dti <", np.quantile(np.array(xtrain['dti'].dropna()), 0.95))
print("Extreme high dti (Above 95%): dti >",
np.quantile(np.array(xtrain['dti'].dropna()), 0.95))
# Count good and bad loans by state
dti_df = xtrain[['dti', 'target']]
dti_df.loc[dti_df['dti'] < 11.89,'dti_cat'] = 'low'
dti_df.loc[(dti_df['dti'] >= 11.89)
& (dti_df['dti'] < 24.48),'dti_cat'] = 'median'
dti_df.loc[(dti_df['dti'] >= 24.48)
& (dti_df['dti'] < 33.88),'dti_cat'] = 'high'
dti_df.loc[dti_df['dti'] > 33.88,'dti_cat'] = 'xhigh'
#Good vs Bad Loans by Income Level
dti_table = dti_df.groupby(['dti_cat', 'target']).size().reset_index().pivot(
columns='target', index='dti_cat', values=0).sort_values(by=1, ascending = False)
ax = dti_table.plot.bar(stacked=True ,figsize= (8, 6))
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Debt-to-Income Level", fontsize=12)
ax.set_ylabel("Count Loans", fontsize=12)
ax.set_title("Borrower's Debt-to-Income Level by Loan status", fontsize=15)
#sizes=[]
for i in range(4):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.2f}%'.format(height/(height + ax.patches[i+4].get_height()) *100),
ha="center", fontsize=10)
for i in range(4,8):
p = ax.patches[i]
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x + width/2.,
y + height/2,
'{:1.2f}%'.format(height/(height + ax.patches[i-4].get_height()) *100),
ha="center", fontsize=10)
plt.show()
Dividing loans into 4 groupbs based on borrowers' debt-to-income ratio:
For high and extreme-high debt-to-income ratio groups, the percentage of loans that are in bad-condition(label = 1) is the highest, above 16%. The group of loans whose borrowers have median debt-to-income ratio has about 13% of chance of turning in bad loans. The borrowers has low debt-to-income ratio is least likely to bad loans (10.6%).
What the plot above shows is: the higher debt-to-income ratio, the higher chance that the loans in bad-condition.
# Loan purpose distribution
plt.figure(figsize = (10,6))
ax = sns.countplot(x="purpose", data=xtrain, order = xtrain.purpose.value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
ax.set_xlabel("Loan Purpose", fontsize=10)
ax.set_ylabel("Count", fontsize=10)
ax.set_title("Loan Purpose Distribution", fontsize=15)
sizes=[]
for p in ax.patches:
height = p.get_height()
sizes.append(height)
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
'{:1.2f}%'.format(height/len(xtrain)*100),
ha="center", fontsize=10)
Noted that about 56% of the loans are used for debt_consolidation, 22.87% for credit card, and 6.66% for home_improvement
To detect colinearity among variables, we can simply create a correlation matrix and find variable pairs with large absolute values. However, multicolinearity is much harder to detect with this method as it involves more than 2 variables. Right now, we focus on identifying colinearity within the data. Later, we will apply Variance Inflation Factor (VIF) method to detect multicolinearity.
# Separate numerica variables and categorical variables
cols = xtrain.columns.tolist()
num_cols = xtrain._get_numeric_data().columns.tolist()
cat_cols = [x for x in cols if x not in num_cols]
# Correlation analysis with pearson method
xtrain_pcorr_matrix = xtrain[num_cols].corr(method="pearson")
with open('xtrain_pcorr.pickle', 'wb') as f:
pickle.dump(xtrain_pcorr_matrix, f)
# with open('xtrain_pcorr.pickle', 'rb') as f:
# xtrain_pcorr_matrix = pickle.load(f)
# High correlation with target variables using pearson method
pcorr_target = xtrain_pcorr_matrix["target"].abs().sort_values(kind="quicksort", ascending=False)
pcorr_target.head(12)
# Unstack covariance matrix pairs
xtrain_pcorr_sorted = xtrain_pcorr_matrix.abs().unstack().sort_values(kind="quicksort",
ascending=False)
# Table of variable pairs and correlation coefficients - pearson method
pcorr_df = pd.DataFrame([xtrain_pcorr_sorted[xtrain_pcorr_sorted != 1.0].drop_duplicates()]).T.reset_index()
# Sorted for highest correlated variable pairs
pcorr_df.head(10)
# Create two lists, one with correlation higher than 0.7 , one with correlation higher than 0.9
# Compare the two variables in each high-correlation pairs, and choose the variable with the higher average correlations with all other variables within the dataset.
pcorr9_cols=set()
pcorr7_cols=set()
for i in range(len(pcorr_df)):
if pcorr_df[0][i] >= 0.7:
if xtrain_pcorr_matrix[pcorr_df.level_0[i]].abs().mean()> xtrain_pcorr_matrix[pcorr_df.level_1[i]].abs().mean():
pcorr7_cols.add(pcorr_df.level_0[i])
else :
pcorr7_cols.add(pcorr_df.level_1[i])
if pcorr_df[0][i] >= 0.9:
if xtrain_pcorr_matrix[pcorr_df.level_0[i]].abs().mean()> xtrain_pcorr_matrix[pcorr_df.level_1[i]].abs().mean():
pcorr9_cols.add(pcorr_df.level_0[i])
else :
pcorr9_cols.add(pcorr_df.level_1[i])
# variables with corr >= 0.7
pcorr7_cols
# variables with corr >=0.9
pcorr9_cols
# Correlation analysis with spearman method
xtrain_spcorr_matrix = xtrain[num_cols].corr(method="spearman")
xtrain_spcorr_sorted = xtrain_spcorr_matrix.abs().unstack().sort_values(kind="quicksort",
ascending=False)
# save the correlation matrix
with open('xtrain_spearcorr.pickle', 'wb') as f:
pickle.dump([xtrain_spcorr_matrix, xtrain_spcorr_sorted], f)
# with open('xtrain_pcorr.pickle', 'rb') as f:
# xtrain_spcorr_matrix, xtrain_spcorr_sorted = pickle.load(f)
# High correlation with target variables using spearman method
spcorr_target = xtrain_spcorr_matrix["target"].abs().sort_values(kind="quicksort",
ascending=False)
spcorr_target.head(10)
# Table of variable pairs and correlation coefficients - pearson method
spcorr_df = pd.DataFrame([xtrain_spcorr_sorted[xtrain_spcorr_sorted != 1.0].drop_duplicates()]).T.reset_index()
# Sorted for highest correlated variable pairs
spcorr_df.head(10)
# Compare the two variables in each high-correlation pairs, and choose the variable with the higher average correlations with all other variables within the dataset.
spcorr7_cols=set()
for i in range(len(spcorr_df)):
if spcorr_df[0][i]>=0.7:
if xtrain_spcorr_matrix[spcorr_df.level_0[i]].abs().mean()> xtrain_spcorr_matrix[spcorr_df.level_1[i]].abs().mean():
spcorr7_cols.add(spcorr_df.level_0[i])
else :
spcorr7_cols.add(spcorr_df.level_1[i])
spcorr9_cols=set()
for i in range(len(spcorr_df)):
if spcorr_df[0][i]>=0.9:
if xtrain_spcorr_matrix[spcorr_df.level_0[i]].abs().mean()> xtrain_spcorr_matrix[spcorr_df.level_1[i]].abs().mean():
spcorr9_cols.add(spcorr_df.level_0[i])
else :
spcorr9_cols.add(spcorr_df.level_1[i])
# variables with correlation coefficients >0.9 (spearman method)
spcorr9_cols
# Combine the results of both pearson and spearman method (corr > 0.7)
corr7_vars = list(set(spcorr7_cols.union(pcorr7_cols)))
corr7_vars
# List of variables with correlation greater than 0.9(combining pearson and spearman method )
corr9_vars = list(set(spcorr9_cols.union(pcorr9_cols)))
corr9_vars
Correlation analysis provides us a good start in understanding how closely variables are related to each other. Data redundancy exists when a dataset has variables that are highly correlated with other variables. We can do dimension reduction with the result from correlation analysis because it not only help us remove data redundancy, speed up processing time, but also reduce model complexity and prevent model overfitting issues. As we are going to investigate multicollinearity issue with Variance Inflation Factor (VIF) method, I am going to perform dimension reduction later by combining the result of both correlation analysis and Variance Inflation Factor analysis.
Pearson and Spearman correlation method only works for calculating correlation between the numeric varibles. For categorical variables, we can test if they are independent with Chi-Squared test of independence. For Categorical vs Numerical Variables, we can perform One-way ANOVA test by calculating in-group variance and intra-group variance and then compare them.