Interesting Findings :
a.Identify most important attributes separating bad loans and good loans
b.Build Xgboost model to make prediction
a.Data first impression
-available predicting variables
-data distribution
-missing values
b.Split dataset in to training and test set
c.correlation analysis and remove multicollinearity
d.Data Visualization to explore relationship between target and predicting variables
a. Handling missing values
b. Transform any characteristics or categorical variables into numeric
c. Create new features from existing features
- Standard scale
- Handling Dataset imbalance issues
* upsampling the minority group
* downsampling the majority group
1). Logistic regression
2). Random Forest model
3). Xgboost model
-hyperparameters Tuning
- Remove variables according to correlation analysis
- Logistic regression with L1 regularization (coeffecients not zero)
- Random Forest model built-in feature importance
The project is split into 3 notebooks. The notebook is part three, focusing on model buiding and evaluation.
import re
import pickle
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import os
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
matplotlib.rcParams.update({'font.size': 10})
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
from sklearn.linear_model import RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTENC
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from matplotlib.offsetbox import AnchoredText
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
# load train, test set created from part1
xtrain = pd.read_csv("xtrain_cleaned.csv")
xtest = pd.read_csv("xtest_cleaned.csv")
ytrain = pd.read_csv("ytrain.csv", header= None)
ytest = pd.read_csv("ytest.csv", header= None)
# Remove the extra index column
ytrain = ytrain[1]
ytest = ytest[1]
xtrain = xtrain.drop("Unnamed: 0",axis = 1)
xtest = xtest.drop("Unnamed: 0",axis = 1 )
xtrain.head(5)
# collect all columns
all_cols = xtrain.columns.tolist()
# compute skewness for all columns
skew_dict= dict()
skew_dict["columns"]=[]
skew_dict["skewness"]=[]
for col in all_cols:
skew_dict["columns"].append(col)
skew_dict["skewness"].append(xtrain[col].skew())
skew_df = pd.DataFrame(skew_dict)
skew_df
# Select features whose skewness is greater than 1 and to use datetype select non-dummy features
skew_list = skew_df.loc[skew_df.skewness > 1, "columns"].values.tolist()
skew_dtype = pd.DataFrame({"skewed columns":skew_list, "dtype": [xtrain[col].dtype for col in skew_list]})
to_tran = skew_dtype.loc[skew_dtype.dtype=="float64", "skewed columns"].tolist()
# Create the Scaler object
scaler = StandardScaler()
# Fit log_transformed data on the scaler object
sca_xtrain = scaler.fit_transform(xtrain)
sca_xtrain = pd.DataFrame(sca_xtrain, columns=all_cols)
# Apply the scaler to transform the test set
sca_xtest = scaler.transform(xtest)
# add back column names
sca_xtest = pd.DataFrame(sca_xtest, columns=all_cols)
# Create training set and validation set
x_tr, x_val, y_tr, y_val = train_test_split(sca_xtrain, ytrain, test_size=0.2, random_state=1, stratify=ytrain)
x_tr.head(5)
# Train Logistic Regression
logreg = LogisticRegression(random_state = 0)
logreg.fit(x_tr, y_tr)
# Run prediction with the validation dataset
logreg_yp = logreg.predict(x_val)
#Use the .predict_proba() and the .predict() methods to get predicted probabilities as well as predicted classes.
logreg_yproba = logreg.predict_proba(x_val)[:,1]
# Make confusion matrix
logreg_matrix = metrics.confusion_matrix(y_val, logreg_yp )
logreg_matrix
%matplotlib inline
sns.heatmap(logreg_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
Althought the ROC-AUC score of prediction on the validation set using first logsitic regression model looks good (about 73%), the recall score of class 1 is 4% while 99% for class 0, which means that the model is really poor at identifying bad-condition loans. When there is a significant difference in recall score between classes, most likely it is due to dataset imbalance (class 1 accounts for only 13%).
# Calculate the classification report.
print(classification_report(y_val, logreg_yp))
#d) Calculate the AUC score
print(roc_auc_score(y_val,logreg_yproba ))
print("Accuracy:", round(metrics.accuracy_score(y_val, logreg_yp),4))
print("Precision:",round(metrics.precision_score(y_val, logreg_yp),4))
print("Recall/TPR:",round(metrics.recall_score(y_val, logreg_yp),4))
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, logreg_yproba)
auc = metrics.roc_auc_score(y_val, logreg_yproba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(round(auc,4)))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
log_l2_model = LogisticRegression(penalty ="l2", random_state = 0, solver ="saga")
log_l2_model.fit(x_tr, y_tr)
with open('log_l2_model.pickle', 'wb') as f:
pickle.dump(log_l2_model, f)
# with open('log_l2_model.pickle', 'rb') as f:
# log_l2_model = pickle.load(f)
# thetaLasso=log_l2_model.coef_[0]
# cols_lasso = thetaLasso.round(0) != 0
# cols_lasso = cols_lasso.tolist()
# # Select features with Logistic regression with L1 penalty
# filtered_list = x_tr.columns[cols_lasso].tolist()
# filtered_list
# Run prediction with the validation dataset
log_l2_pred = log_l2_model.predict(x_val)
# Make confusion matrix
log_l2_matrix = metrics.confusion_matrix(y_val, log_l2_pred)
log_l2_matrix
%matplotlib inline
sns.heatmap(log_l2_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
# Calculate the classification report.
print(classification_report(y_val, log_l2_pred))
print("Accuracy:", round(metrics.accuracy_score(y_val, log_l2_pred),4))
print("Precision:",round(metrics.precision_score(y_val, log_l2_pred),4))
print("Recall/TPR:",round(metrics.recall_score(y_val, log_l2_pred),4))
# Get the probability score for observations.
log_l2_proba = log_l2_model.predict_proba(x_val)[::,1]
# Calculate the AUC score
print(roc_auc_score(y_val,log_l2_proba))
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, log_l2_proba)
auc = metrics.roc_auc_score(y_val, log_l2_proba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(round(auc,4)))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
Again, we noted that the Recall scores on the second model were pretty low, at 0.04, which means that it can only correctly identify 4% of the bad loans out of all the bad loans in the data set. The first two models didn't perform very well in predicting on the validation set. Let's start with handling the dataset imbalance issue.
Here I am using the following two methods:
# Random over-samples the minority group to handle the imbalanced dataset
ros = RandomOverSampler(sampling_strategy =0.5,random_state=42)
x_res, y_res = ros.fit_resample(x_tr, y_tr)
# Check out the ratio of bad loans vs good loans in our resampled dataset
sum(y_res==1)/len(y_res)
logreg_2 = LogisticRegression(random_state = 0)
logreg_2.fit(x_res, y_res)
# Run prediction with the validation dataset
logreg_2_yp = logreg_2.predict(x_val)
#Use the .predict_proba() and the .predict() methods to get predicted probabilities as well as predicted classes.
logreg_2_yproba = logreg_2.predict_proba(x_val)[:,1]
# Make confusion matrix
logreg_2_matrix = metrics.confusion_matrix(y_val, logreg_2_yp )
logreg_2_matrix
%matplotlib inline
sns.heatmap(logreg_2_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
The recall score of class 1 is improved to 0.35
# Calculate the classification report.
print(classification_report(y_val, logreg_2_yp))
#d) Calculate the AUC score
print(roc_auc_score(y_val,logreg_2_yproba ))
print("Accuracy:", round(metrics.accuracy_score(y_val, logreg_2_yp),4))
print("Precision:",round(metrics.precision_score(y_val, logreg_2_yp),4))
print("Recall/TPR:",round(metrics.recall_score(y_val, logreg_2_yp),4))
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, logreg_2_yproba)
auc = metrics.roc_auc_score(y_val, logreg_2_yproba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(round(auc,4)))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
# Random under-samples the majority group to handle the imbalanced dataset
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
# fit and apply the transform
x_rus, y_rus = rus.fit_resample(x_tr, y_tr)
# #Another oversampling method is SMOTE method, upsampling by generating synthetic data
# sm = SMOTENC(random_state=42, sampling_strategy =0.5)
# x_sa, y_sa = sm.fit_resample(x_tr, y_tr)
# Check out the ratio of bad loans vs good loans in our resampled dataset
sum(y_rus==1)/len(y_rus)
x_rus.shape
logreg_3 = LogisticRegression(random_state = 0)
logreg_3.fit(x_rus, y_rus)
# Run prediction with the training dataset
logreg_3_yp_rus = logreg_3.predict(x_rus)
# Calculate the classification report for prediction on training set
print(classification_report(y_rus, logreg_3_yp_rus))
# Run prediction with the validation dataset
logreg_3_yp = logreg_3.predict(x_val)
#Use the .predict_proba() and the .predict() methods to get predicted probabilities as well as predicted classes.
logreg_3_yproba = logreg_3.predict_proba(x_val)[:,1]
# Calculate the classification report for prediction on validation set
print(classification_report(y_val, logreg_3_yp))
# Make confusion matrix
logreg_3_matrix = metrics.confusion_matrix(y_val, logreg_3_yp )
logreg_3_matrix
%matplotlib inline
sns.heatmap(logreg_3_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
The recall score of class 1 is improved to 0.35
#d) Calculate the AUC score
print(roc_auc_score(y_val,logreg_3_yproba ))
print("Accuracy:", round(metrics.accuracy_score(y_val, logreg_3_yp),4))
print("Precision:",round(metrics.precision_score(y_val, logreg_3_yp),4))
print("Recall/TPR:",round(metrics.recall_score(y_val, logreg_3_yp),4))
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, logreg_3_yproba)
auc = metrics.roc_auc_score(y_val, logreg_3_yproba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(round(auc,4)))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
Here I am using the following two methods:
# Random over-samples the minority group to be the same sample size as the majority group
ros_2 = RandomOverSampler(sampling_strategy =1,random_state=42)
x_res_2, y_res_2 = ros_2.fit_resample(x_tr, y_tr)
# Check out the ratio of bad loans vs good loans in our resampled dataset
sum(y_res_2==1)/len(y_res_2)
logreg_4 = LogisticRegression(random_state = 0)
logreg_4.fit(x_res_2, y_res_2)
# Run prediction with the validation dataset
logreg_4_yp = logreg_4.predict(x_val)
#Use the .predict_proba() and the .predict() methods to get predicted probabilities as well as predicted classes.
logreg_4_yproba = logreg_4.predict_proba(x_val)[:,1]
# Make confusion matrix
logreg_4_matrix = metrics.confusion_matrix(y_val, logreg_4_yp )
logreg_4_matrix
%matplotlib inline
sns.heatmap(logreg_4_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
The recall score of class 1 is improved to 0.67
# Calculate the classification report.
print(classification_report(y_val, logreg_4_yp))
#d) Calculate the AUC score
print(roc_auc_score(y_val,logreg_4_yproba ))
print("Accuracy:", round(metrics.accuracy_score(y_val, logreg_4_yp),4))
print("Precision:",round(metrics.precision_score(y_val, logreg_4_yp),4))
print("Recall/TPR:",round(metrics.recall_score(y_val, logreg_4_yp),4))
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, logreg_4_yproba)
auc = metrics.roc_auc_score(y_val, logreg_4_yproba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(round(auc,4)))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
We can use the built-in variable importance in RF model for our feature selections.
# Instantiate a RandomforestClassifier using default Gini method
rf = RandomForestClassifier(n_estimators=100, random_state=42)
# Fit dt to the training set
rf.fit(x_res, y_res)
with open('rf_model.pickle', 'wb') as f:
pickle.dump(rf, f)
# with open('rf_model.pickle', 'rb') as f:
# rf = pickle.load(f)
# Run prediction with the validation dataset
rf_yp = rf.predict(x_val)
# Make confusion matrix
rf_matrix = metrics.confusion_matrix(y_val,rf_yp)
rf_matrix
%matplotlib inline
sns.heatmap(rf_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
# Calculate the classification report.
print(classification_report(y_val, rf_yp))
print("Accuracy:", round(metrics.accuracy_score(y_val, rf_yp),4))
print("Precision:",round(metrics.precision_score(y_val, rf_yp),4))
print("Recall/TPR:",round(metrics.recall_score(y_val, rf_yp),4))
# Get the probability score for observations.
rf_yproba = rf.predict_proba(x_val)[::,1]
# Calculate the AUC score
print(roc_auc_score(y_val,rf_yproba ))
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, rf_yproba)
auc = metrics.roc_auc_score(y_val, rf_yproba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(round(auc,4)))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
# Create a pd.Series of features importances
importances_rf = pd.Series(rf.feature_importances_,
index = x_tr.columns)
# Sort importances_rf
sorted_importances_rf = importances_rf.sort_values(ascending=False)
sorted_importances_rf[:20]
print("The first 20 features accounts for " + str(round(sum(sorted_importances_rf[:20]),4)*100)+"%")
rf_feat = sorted_importances_rf.index[:20].tolist()
print("Top 20 Features selected by random forest:")
print(rf_feat)
def plot_feature_importance(model, dataset):
#Create arrays from feature importance and feature names
feature_importance = np.array(model.feature_importances_)
feature_names = np.array(dataset.columns)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
fi_df_tp20 =fi_df[:20]
#Define size of bar plot
plt.figure(figsize=(10,8))
#Plot Searborn bar chart
sns.barplot(x=fi_df_tp20['feature_importance'], y=fi_df_tp20['feature_names'])
#Add chart labels
plt.title(type(model).__name__ + ' FEATURE IMPORTANCE')
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
plot_feature_importance(rf, x_tr)
#data_dmatrix = xgb.DMatrix(data=x_res ,label=y_res)
xgb1 = XGBClassifier(
learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
scale_pos_weight=1,
eval_metric='auc',
seed=27)
# Fit the data into xgboost model-1
xgb1.fit(x_res, y_res)
# Predict on the validation set
xg_y_pred = xgb1.predict(x_val)
with open('xgb1_model.pickle', 'wb') as f:
pickle.dump(xgb1, f)
# with open('xgb1_model.pickle', 'rb') as f:
# xgb1 = pickle.load(f)
xgb1_matrix = metrics.confusion_matrix(y_val, xg_y_pred)
%matplotlib inline
sns.heatmap(xgb1_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
# Calculate the classification report.
print(classification_report(y_val, xg_y_pred))
print("Accuracy (Validation set):", round(metrics.accuracy_score(y_val, xg_y_pred),4))
print("TestPrecision (Validation set):",round(metrics.precision_score(y_val, xg_y_pred),4))
print("Recall/TPR(Validation set):",round(metrics.recall_score(y_val, xg_y_pred),4))
# Get the probability score for observations.
xg1_y_pproba = xgb1.predict_proba(x_val)[::,1]
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(y_val, xg1_y_pproba)
auc = metrics.roc_auc_score(y_val, xg1_y_pproba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(auc))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
# Create a pd.Series of features importances
importances_xgb1 = pd.Series(xgb1.feature_importances_,
index = x_tr.columns)
# Sort importances_rf
sorted_importances_xgb1 = importances_xgb1.sort_values(ascending=False)
sorted_importances_xgb1[:30]
plot_feature_importance(xgb1, x_tr)
# x_val = pd.DataFrame(x_val, columns=all_cols)
# x_res = pd.DataFrame(x_res, columns=all_cols)
# Parameters to tune
params_tune = {'max_depth':[3,4,6],
'gamma': [0.5, 1, 1.5],
'min_child_weight':[2,4,6]}
# initial the base models
model_estimator = xgb.XGBClassifier(learning_rate =0.01,
n_estimators=500,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
# Build RandomSearch
clf = RandomizedSearchCV(model_estimator,
param_distributions=params_tune,
scoring="roc_auc",
n_jobs=4,
cv=3,
verbose=3,
random_state=42)
# Fit the RandomSearch Model
clf.fit(x_res, y_res)
print('\n All results:')
print(clf.cv_results_)
print('\n Best estimator:')
print(clf.best_estimator_)
print('\n Best hyperparameters:')
print(clf.best_params_)
results = pd.DataFrame(clf.cv_results_)
results.to_csv('xgb-random-search-results-01.csv', index=False)
best_estimator = clf.best_estimator_
best_estimator
# Save model
pickle.dump(clf.best_estimator_, open("xgb_rdm_best_model.pickle", "wb"))
best_estimator
plot_feature_importance(best_estimator, x_tr)
I used XGBClassifier an sklearn wrapper for XGBoost. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM. We can also use the xgboost package to run the model as follows.
# Perform Prediction on the test set
ytest_pred = best_estimator.predict(sca_xtest)
xgb_2_matrix = metrics.confusion_matrix(ytest, ytest_pred)
print(xgb_2_matrix)
%matplotlib inline
sns.heatmap(xgb_2_matrix,annot=True,cmap="YlGnBu" ,fmt='g')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
class_names=[0,1] # name of classes
xlocs, xlabels = plt.xticks()
plt.xticks(xlocs,class_names)
ylocs, ylabels = plt.xticks()
plt.yticks(ylocs,class_names)
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
# Calculate the classification report.
print(classification_report(ytest, ytest_pred))
def true_negative_rate(tn, fp):
"""
Args:
tn: True Negatives (count)
fp: False Positives (count)
Returns: true_negative_rate
"""
return round(tn / (tn + fp), 2)
def false_positive_rate(fp, tn):
"""
Args:
fp: False Positives (count)
tn: True Negatives (count)
Returns: false_positive_rate
"""
return round(fp / (fp + tn), 2)
def false_negative_rate(fn, tp):
"""
Args:
fn: False Negatives (count)
tp: True Positives (count)
Returns: false_negative_rate
"""
return round(fn / (fn + tp), 2)
tn, fp, fn, tp = metrics.confusion_matrix(ytest, ytest_pred).ravel()
#Sensitivity = TruePositive / (TruePositive + FalseNegative)
Sensitivity = tp/(tp + fn)
print("specificity(Test set):",round(Sensitivity,4))
# Sensitivity is also the recall
recall = Sensitivity
# Specificity = TrueNegative / (FalsePositive + TrueNegative)
Specificity = tn/(fp+tn)
print("Specificity(Test set)",round(Specificity,4))
# precision
precision = tp/(tp+fp)
print("precision(Test set)", round(precision,4 ))
f_measure = (2 * precision * recall) / (precision + recall)
print("f_measure(Test set)", round(f_measure,4 ))
print("Accuracy (Test set):", round(metrics.accuracy_score(ytest, ytest_pred),4))
print("Precision (Test set):",round(metrics.precision_score(ytest, ytest_pred),4))
print("Recall/TPR(Test set):",round(metrics.recall_score(ytest, ytest_pred),4))
# Get the probability score for observations.
y_pred_proba = best_estimator.predict_proba(sca_xtest)[::,1]
# Graph Receiver Operating Characteristic(ROC) curve and compute AUC (Area Under the Receiver Operating Characteristic Curve) from prediction scores.
fpr, tpr, _ = metrics.roc_curve(ytest, y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr,tpr,label="ROC,AUC="+str(auc))
plt.title('ROC Curve', y=1.1)
plt.legend(loc=4)
plt.show()
average_precision = metrics.average_precision_score(ytest, y_pred_proba)
disp_prec = metrics.plot_precision_recall_curve(best_estimator, sca_xtest, ytest)
disp_prec.ax_.set_title('Precision-Recall curve' +
' : AP={0:0.2f}'.format(average_precision))
plt.show()
Finally, we use the best estimator from the randomized search of xgboost model to classify loan default(yes or no) on the test set. The overall accuracy score is 0.81%, which means that model can distinguish between good loans and default loans 81% of the time. This is not bad.
However, there are more to consider since the lending club loans is an imbalanced dataset - the minority group(bad loans) accounts for only 13% of the population. There are two groups of metrics that may be useful in evaluating imbalanced classification because they allow us to focus on one class at a time, that is to assess how well the model predict each class separately. The are sensitivity-specificity and precision-recall(good expalanation here).
sensitivity-specificity
Sensitivity(also recall, true positive rate) summarizes how well the positive class was predicted (measures the proportion of positives that are correctly identifie). On the other hand, specificity measures the proportion of negatives that are correctly identified. For imbalanced classification (most of the time, the positive class is the minority class), we are more concerned Sensitivity than specificity.
The sensitity is 38.3% while the specificity is 87.9%. What this is telling us is that the model is really good at identifying good loans but not so much for "bad" loans. The model is able to identify a "good" loans 88% of the time on average, while it has 38% chances to correctly identify one "default" loan (like 2 out of 3 default loans will be missed).
precision-recall
Another useful tool for evaluating imbalanced classification is ROC(Receiver Operating Characteristic) curve and ROC AUC(area under ROC curve) score. A no skill model will have a ROC AUC score of 0.5, whereas a perfect model will have a ROC AUC score of 1.0. The ROC AUC score of xgboost model using the test dataset is 0.75, which is means it is a skilled model.
To improve this project, I would like to test xgboost models with the features selected by Lasso regression and random forest. Also, I would like to perform hyperparameters tuning for Xgboost model.