Exercise IV: Logistic Regression¶
In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one. Wikipedia
In this exercise we will reproduce the bank defaults example used in chapter IV of the ISLR, as adapted from the ISLR-python repository.
Setup¶
import warnings
warnings.simplefilter("ignore")
import pandas as pd
URL = "https://github.com/JWarmenhoven/ISLR-python/raw/master/Notebooks/Data/Default.xlsx"
df = pd.read_excel(URL, index_col=0, true_values=["Yes"], false_values=["No"])
def color_booleans(value: bool) -> str:
    color = "green" if value else "red"
    return f"color: {color}"
BOOLEAN_COLUMNS = ["default", "student"]
df.sample(15).style.text_gradient(cmap="Blues").applymap(color_booleans, subset=BOOLEAN_COLUMNS)
| default | student | balance | income | |
|---|---|---|---|---|
| 1097 | False | True | 1461.833249 | 19252.237291 | 
| 2192 | False | True | 1295.029713 | 21717.469549 | 
| 1393 | False | False | 468.412124 | 22943.125687 | 
| 7933 | False | True | 1273.488513 | 17105.867047 | 
| 7366 | False | False | 149.822550 | 33948.866290 | 
| 280 | False | False | 369.221942 | 47835.091045 | 
| 9352 | False | False | 0.000000 | 62160.286220 | 
| 5618 | False | False | 1267.001444 | 32752.193741 | 
| 5537 | False | False | 514.849025 | 37656.847872 | 
| 1100 | False | False | 1516.551152 | 39368.141873 | 
| 4184 | False | False | 202.070433 | 28814.175306 | 
| 6557 | False | False | 1135.187984 | 47075.032300 | 
| 4728 | False | True | 649.360117 | 10524.326101 | 
| 5343 | False | False | 1277.123098 | 42472.908266 | 
| 7496 | False | False | 717.667390 | 27956.273520 | 
Feature Scaling¶
import numpy as np
from sklearn.preprocessing import StandardScaler
numeric_features = df.select_dtypes(np.float)
scaler = StandardScaler()
df.loc[:, numeric_features.columns] = scaler.fit_transform(df.loc[:, numeric_features.columns])
Raw inspection¶
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 10000
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  bool   
 1   student  10000 non-null  bool   
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: bool(2), float64(2)
memory usage: 253.9 KB
pd.set_option('float_format', '{:g}'.format)
df.describe()
| balance | income | |
|---|---|---|
| count | 10000 | 10000 | 
| mean | -1.25056e-16 | -1.93623e-16 | 
| std | 1.00005 | 1.00005 | 
| min | -1.72708 | -2.45539 | 
| 25% | -0.731136 | -0.913058 | 
| 50% | -0.0242674 | 0.0776593 | 
| 75% | 0.684184 | 0.771653 | 
| max | 3.76056 | 3.0022 | 
Scatter plot¶
import matplotlib.pyplot as plt
import seaborn as sns
fix, ax = plt.subplots(figsize=(15, 12))
_ = sns.scatterplot(x="balance",
                    y="income",
                    hue="default",
                    style="student",
                    size="default",
                    sizes={
                        True: 100,
                        False: 40
                    },
                    alpha=0.6,
                    ax=ax,
                    data=df)
Violin plot¶
# Create a new figure with two horizontal subplots 
fig, ax = plt.subplots(ncols=2, figsize=(15, 6))
# Plot balance
sns.violinplot(x="student", y="balance", hue="default", split=True, legend=False, ax=ax[0], data=df)
ax[0].get_legend().remove()
ax[0].set_xlabel('')
# Plot income
sns.violinplot(x="student", y="income", hue="default", split=True, ax=ax[1], data=df)
ax[1].set_xlabel('')
# Add common label
_ = fig.text(0.5, 0.05, "student", ha='center')
Train/Test Split¶
from sklearn.model_selection import train_test_split 
FEATURE_NAMES = ["balance", "income", "student"]
TARGET_NAME = "default"
X = df[FEATURE_NAMES]
y = df[TARGET_NAME].values
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=0,
                                                    test_size=0.2)
Model Creation¶
sklearn¶
from sklearn.linear_model import LogisticRegression
sk_model = LogisticRegression(random_state=0, penalty="none", solver="newton-cg")
_ = sk_model.fit(X_train, y_train)
statsmodels¶
import statsmodels.api as sm
# statsmodels requires booelean values to be converted to integers.
df["student"] = df["student"].astype(int)
df["default"] = df["default"].astype(int)
# R-style model formulation.
sm_model = sm.Logit.from_formula('default ~ balance + income + student', data=df)
Model Application¶
sklearn¶
We can predict the probability estimates of each target class (in our case True or False) using the LogisticRegression class’s predict_proba() method:
default_probability = sk_model.predict_proba(X_test)
Or, we could directly return the predictions based on the maximal probabilities:
predictions = sk_model.predict(X_test)
# Manually returning the index of the maximal value
predictions_manual = default_probability.argmax(axis=1)
np.array_equal(predictions, predictions_manual)
True
statsmodels¶
sm_estimation = sm_model.fit()
Optimization terminated successfully.
         Current function value: 0.078577
         Iterations 10
Model Evaluation¶
sklearn¶
print(f"Intercept: {sk_model.intercept_}")
print(f"Coefficients: {sk_model.coef_}")
Intercept: [-5.93187127]
Coefficients: [[ 2.72019068 -0.02263331 -0.61746372]]
Confusion Matrix¶
Calculation¶
from sklearn.metrics import confusion_matrix
confusion_matrix_ = confusion_matrix(y_test, predictions)
print(confusion_matrix_)
[[1920    6]
 [  51   23]]
Visualization¶
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(sk_model,
                             X_test,
                             y_test,
                             cmap=plt.cm.Greens,
                             normalize="true")
_ = disp.ax_.set_title(f"Confusion Matrix")
Classification Report¶
from sklearn.metrics import classification_report
report = classification_report(y_test, predictions)
print(report)
              precision    recall  f1-score   support
       False       0.97      1.00      0.99      1926
        True       0.79      0.31      0.45        74
    accuracy                           0.97      2000
   macro avg       0.88      0.65      0.72      2000
weighted avg       0.97      0.97      0.97      2000
Precision¶
true_positive = confusion_matrix_[1, 1]
false_positive = confusion_matrix_[0, 1]
true_positive / (true_positive + false_positive)
0.7931034482758621
Recall (Sensitivity)¶
false_negative = confusion_matrix_[1, 0]
true_positive / (true_positive + false_negative)
0.3108108108108108
Specificity¶
true_negative = confusion_matrix_[0, 0]
true_negative / (true_negative + false_positive)
0.9968847352024922
Accuracy Score¶
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)
0.9715
or:
true_predictions = confusion_matrix_[0, 0] + confusion_matrix_[1, 1]
true_predictions / len(X_test)
0.9715
statsmodels¶
Confusion Matrix¶
prediction_table = sm_estimation.pred_table()
prediction_table
array([[9627.,   40.],
       [ 228.,  105.]])
row_sums = prediction_table.sum(axis=1, keepdims=True)
prediction_table / row_sums
array([[0.99586221, 0.00413779],
       [0.68468468, 0.31531532]])
Regression Report¶
sm_estimation.summary()
| Dep. Variable: | default | No. Observations: | 10000 | 
|---|---|---|---|
| Model: | Logit | Df Residuals: | 9996 | 
| Method: | MLE | Df Model: | 3 | 
| Date: | Wed, 15 Dec 2021 | Pseudo R-squ.: | 0.4619 | 
| Time: | 07:44:26 | Log-Likelihood: | -785.77 | 
| converged: | True | LL-Null: | -1460.3 | 
| Covariance Type: | nonrobust | LLR p-value: | 3.257e-292 | 
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -5.9752 | 0.194 | -30.849 | 0.000 | -6.355 | -5.596 | 
| balance | 2.7747 | 0.112 | 24.737 | 0.000 | 2.555 | 2.995 | 
| income | 0.0405 | 0.109 | 0.370 | 0.712 | -0.174 | 0.255 | 
| student | -0.6468 | 0.236 | -2.738 | 0.006 | -1.110 | -0.184 | 
Possibly complete quasi-separation: A fraction 0.15 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.