Exercise III: Linear Regression

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). Wikipedia

Using linear regression with Python is as easy as running:

>>> from sklearn.linear_model import LinearRegression
>>> model = LinearRegression()
>>> model.fit(X_train, y_train)
>>> predictions = model.predict(X_test)

Diabetes

First, let’s reproduce scikit-learn’s Linear Regerssion Example using the prepackaged diabetes dataset.

Loading the Dataset

import pandas as pd

from sklearn import datasets

# Read the dataset as a pandas DataFrame
dataset = datasets.load_diabetes(as_frame=True)

# Create observations matrix and target vector
X, y = dataset.data, dataset.target

# Create a unified DataFrame containing both
data = pd.concat([X, y], axis=1)

Basic Exploration

Dataset Description

print(dataset.DESCR)
.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

Raw Inspection

import numpy as np
import pandas as pd

# Select some sample indices
sample_indices = np.linspace(0, len(X) - 4, 4, dtype=int)
sample_indices = [index for i in sample_indices for index in range(i, i + 4)]

# Print data table (features and target)
data.iloc[sample_indices, :].style.set_properties(**{
    "text-align": "center",
}).set_properties(**{
    "border-left": "4px solid black"
}, subset=['target']).set_table_styles([
    dict(selector="th", props=[("font-size", "13px")]),
    dict(selector="td", props=[("font-size", "11px")]),
]).background_gradient().set_precision(3)
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/ipykernel_launcher.py:16: FutureWarning: this method is deprecated in favour of `Styler.format(precision=..)`
  app.launch_new_instance()
  age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038 0.051 0.062 0.022 -0.044 -0.035 -0.043 -0.003 0.020 -0.018 151.000
1 -0.002 -0.045 -0.051 -0.026 -0.008 -0.019 0.074 -0.039 -0.068 -0.092 75.000
2 0.085 0.051 0.044 -0.006 -0.046 -0.034 -0.032 -0.003 0.003 -0.026 141.000
3 -0.089 -0.045 -0.012 -0.037 0.012 0.025 -0.036 0.034 0.023 -0.009 206.000
146 -0.031 0.051 0.060 0.001 0.012 0.032 -0.043 0.034 0.015 0.007 178.000
147 -0.056 -0.045 0.093 -0.019 0.015 0.023 -0.029 0.025 0.026 0.040 128.000
148 -0.060 0.051 0.015 -0.019 0.037 0.048 0.019 -0.003 -0.031 -0.001 96.000
149 -0.049 0.051 -0.005 -0.047 -0.021 -0.020 -0.069 0.071 0.061 -0.038 126.000
292 0.009 -0.045 -0.022 -0.032 -0.050 -0.069 0.078 -0.071 -0.063 -0.038 84.000
293 -0.071 -0.045 0.093 0.013 0.020 0.043 0.001 0.000 -0.055 -0.001 200.000
294 0.024 0.051 -0.031 -0.006 -0.017 0.018 -0.032 -0.003 -0.074 -0.034 55.000
295 -0.053 0.051 0.039 -0.040 -0.006 -0.013 0.012 -0.039 0.016 0.003 85.000
438 -0.006 0.051 -0.016 -0.068 0.049 0.079 -0.029 0.034 -0.018 0.044 104.000
439 0.042 0.051 -0.016 0.017 -0.037 -0.014 -0.025 -0.011 -0.047 0.015 132.000
440 -0.045 -0.045 0.039 0.001 0.016 0.015 -0.029 0.027 0.045 -0.026 220.000
441 -0.045 -0.045 -0.073 -0.081 0.084 0.028 0.174 -0.039 -0.004 0.003 57.000
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB
pd.set_option('float_format', '{:g}'.format)
data.describe()
age sex bmi bp s1 s2 s3 s4 s5 s6 target
count 442 442 442 442 442 442 442 442 442 442 442
mean -3.63962e-16 1.30991e-16 -8.01395e-16 1.28982e-16 -9.04254e-17 1.30112e-16 -4.56397e-16 3.86317e-16 -3.8481e-16 -3.39849e-16 152.133
std 0.047619 0.047619 0.047619 0.047619 0.047619 0.047619 0.047619 0.047619 0.047619 0.047619 77.093
min -0.107226 -0.0446416 -0.0902753 -0.1124 -0.126781 -0.115613 -0.102307 -0.0763945 -0.126097 -0.137767 25
25% -0.0372993 -0.0446416 -0.0342291 -0.0366564 -0.0342478 -0.0303584 -0.0351172 -0.0394934 -0.0332488 -0.033179 87
50% 0.00538306 -0.0446416 -0.00728377 -0.00567061 -0.00432087 -0.00381907 -0.00658447 -0.00259226 -0.00194763 -0.0010777 140.5
75% 0.0380759 0.0506801 0.031248 0.0356438 0.028358 0.0298444 0.0293115 0.0343089 0.0324332 0.0279171 211.5
max 0.110727 0.0506801 0.170555 0.132044 0.153914 0.198788 0.181179 0.185234 0.133599 0.135612 346

Scatter Matrix

import matplotlib.pyplot as plt

# Extract numeric data
numeric_data = data.drop(["sex"], axis=1)

# Plot scatter matrix
axes = pd.plotting.scatter_matrix(numeric_data, figsize=(15, 15))

# Fix y-axis label formatting
new_labels = [
    round(float(i.get_text()), 2) for i in axes[0, 0].get_yticklabels()
]
_ = axes[0, 0].set_yticklabels(new_labels)
../../_images/exercise_03_14_0.png

Feature Correlation

We can use seaborn’s heatmap function to inspect feature correlations.

import seaborn as sns

# Calculate correlation matrix using NumPy
correlation_matrix = np.corrcoef(data.values.T)

# Plot correlation matrix using seaborn
fig, ax = plt.subplots(figsize=(8, 8))
tick_labels = list(X.columns) + ['diabetes']
hm = sns.heatmap(
    correlation_matrix,
    ax=ax,
    cbar=True,  # Show colorbar
    cmap="vlag",  # Specify colormap
    vmin=-1,  # Min. value for colormapping
    vmax=1,  # Max. value for colormapping
    annot=True,  # Show the value of each cell
    square=True,  # Square aspect ratio in cell sizing
    fmt='.2f',  # Float formatting
    annot_kws={'size':
               12},  # Font size of the values displayed within the cells
    xticklabels=tick_labels,  # x-axis labels
    yticklabels=tick_labels)  # y-axis labels
plt.tight_layout()
plt.show()
../../_images/exercise_03_17_0.png

Simple Linear Regression

Simple linear regression is a linear regression model with a single explanatory variable. bmi seems to show a discernible linear relationship with the target variable, so let’s go with that one.

Train/Test Split

from sklearn.model_selection import train_test_split

# Create a vector of the single predictor values
simple_X = X.bmi.to_numpy().reshape(len(X), 1)

# Split for simple linear regression
simple_X_train, simple_X_test, y_train, y_test = train_test_split(simple_X,
                                                                  y,
                                                                  random_state=0,
                                                                  test_size=0.2)

We now have a training dataset consisting of 353 observations and a test dataset with 89 observations.

Model Creation

from sklearn.linear_model import LinearRegression

model = LinearRegression()
_ = model.fit(simple_X_train, y_train)

Model Application

simple_y_pred = model.predict(simple_X_test)

Model Evaluation

First, we’ll plot our predicted values alongside the observed values, as well as the regression line estimated by our model.

# Create figure
fig, ax = plt.subplots(figsize=(15, 7))

# Plot real values scatter plot
_ = plt.scatter(simple_X_test, y_test, color="black", label="Real Values")

# Plot predicted values scatter plot
_ = plt.scatter(simple_X_test,
                simple_y_pred,
                color="red",
                label="Predicted Values")

# Plot regression line
_ = plt.plot(simple_X_test,
             simple_y_pred,
             color="blue",
             label="Regression Line")

# Show legend
_ = plt.legend()

# Set title
title = "Disease Progression by BMI"
plt.title(title)

# Sex axis labels
ax.set_xlabel("Standardized BMI score")
_ = ax.set_ylabel("Disease Progression")
../../_images/exercise_03_30_0.png

We can also use the sklearn.metrics module to calculate the MSE and coefficient of determination (\(R^2\)).

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, simple_y_pred)
r_squared = r2_score(y_train, model.predict(simple_X_train))
print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f'Mean Squared Error: {mse:.2f}')
print(f'Coefficient of Determination: {r_squared:.2f}')
Coefficient: 981.66
Intercept: 152.29
Mean Squared Error: 4150.68
Coefficient of Determination: 0.38

Multiple Linear Regression

General notation:

OLS

Calculating the coefficient vector of the least-squares hyperplane:

OLS beta hat

Manual Implementation

While it is always best to use the standard tools Python offers, let’s be educationally paranoid for a minute and create our own implementation of ordinary least-squared (OLS) multiple linear regression.

A simplist implementation might look something like:

import numpy as np
import pandas as pd


def append_intercept(X: pd.DataFrame) -> pd.DataFrame:
    """
    Append a column of ones to the beginning of the dataset.

    Parameters
    ----------
    X : pd.DataFrame
        Observed features data

    Returns
    -------
    pd.DataFrame
        Design matrix
    """
    intercept = np.ones((len(X), 1))
    return np.hstack([intercept, X])


def estimate_coefficients(X: pd.DataFrame, y: np.ndarray) -> np.ndarray:
    """
    Returns the estimated beta hat vector.

    Parameters
    ----------
    X : pd.DataFrame
        Observed features data
    y : np.ndarray
        Target vector

    Returns
    -------
    np.ndarray
        Estimated coefficients
    """
    X = append_intercept(X)
    return np.linalg.inv(X.T @ X) @ (X.T @ y)


def calculate_predictions(X: pd.DataFrame, beta_hat: np.ndarray) -> np.ndarray:
    """
    Returns the predicted values for a test set of observations (X) using a
    vector of the estimated coefficients.

    Parameters
    ----------
    X : pd.DataFrame
        Test features data
    beta_hat : np.ndarray
        Estimated coefficients vector

    Returns
    -------
    np.ndarray
        Estimated coefficients
    """
    X = append_intercept(X)
    return X @ beta_hat

Another version, which better conforms with standard implementations in machine learning models, might look something like:

class ManualLinearRegression:
    """
    Manual implementation of an Ordinary Least Squares (OLS) linear regression
    model.
    """

    def __init__(self) -> None:
        """
        Initializes a new instance of the ManualLinearRegression model.
        """
        self.beta_hat = None

    def append_intercept(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Append a column of ones to the beginning of the dataset.

        Parameters
        ----------
        X : pd.DataFrame
            Observed features data

        Returns
        -------
        pd.DataFrame
            Design matrix
        """
        intercept = np.ones((len(X), 1))
        return np.hstack([intercept, X])

    def estimate_coefficients(
        self, X: pd.DataFrame, y: np.ndarray
    ) -> np.ndarray:
        """
        Returns the estimated beta hat vector.

        Parameters
        ----------
        X : pd.DataFrame
            Observed features data
        y : np.ndarray
            Target vector

        Returns
        -------
        np.ndarray
            Estimated coefficients
        """
        X = self.append_intercept(X)
        return np.linalg.inv(X.T @ X) @ (X.T @ y)

    def fit(self, X: pd.DataFrame, y: np.ndarray) -> None:
        """
        Calculate the estimated coefficients vector (beta hat).

        Parameters
        ----------
        X : pd.DataFrame
            Observed features data
        y : np.ndarray
            Target vector
        """
        self.beta_hat = self.estimate_coefficients(X, y)

    def predict(self, X: pd.DataFrame):
        """
        Returns the predicted values for a test set of observations (X) using a
        vector of the estimated coefficients.

        Parameters
        ----------
        X : pd.DataFrame
            Test features data
        beta_hat : np.ndarray
            Estimated coefficients vector

        Returns
        -------
        np.ndarray
            Estimated coefficients
        """
        if self.beta_hat is not None:
            X = self.append_intercept(X)
            return X @ self.beta_hat
        raise RuntimeError("Please fit the model before calling predict().")
Model Application
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=0,
                                                    test_size=0.2)
beta_hat = estimate_coefficients(X_train, y_train)
y_pred_manual = calculate_predictions(X_test, beta_hat)

or:

manual_model = ManualLinearRegression()
manual_model.fit(X_train, y_train)
y_pred_manual = manual_model.predict(X_test)
Model Evaluation
\(R^2\)

Calculate the residual sum of squares (\(RSS\)):

General notation

RSS general notation

Matrix notation

RSS linear notation

ss_residual = np.sum((y_train - manual_model.predict(X_train))**2)
ss_residual
965359.4331583097

Calculate the total sum of squares (\(TSS\)):

TSS

ss_total = np.sum((y_train - y_train.mean())**2)
ss_total
2164136.266288952

Calculate \(R^2\):

R squared

r2_score_manual = 1 - ss_residual / ss_total
r2_score_manual
0.5539285357415583

Compare with sklearn’s built-in r2_score function:

r2_score(y_true=y_train, y_pred=manual_model.predict(X_train))
0.5539285357415583
Mean Squared Error

Manual \(MSE\) calculation:

mse_manual = np.mean((y_train - manual_model.predict(X_train))**2)
mse_manual
2734.7292724031436

Or using sklearn:

mean_squared_error(y_train, manual_model.predict(X_train))
2734.7292724031436

Multiple Linear Regression using sklearn

Model Creation
multiple_model = LinearRegression()
_ = multiple_model.fit(X_train, y_train)
Model Application
y_pred_all = multiple_model.predict(X_test)
Model Evaluation
mse_all = mean_squared_error(y_train, multiple_model.predict(X_train))
r2_score_all = r2_score(y_train, multiple_model.predict(X_train))
betas_all_repr = [beta.round(2) for beta in multiple_model.coef_.flatten()]

print(f"Coefficient: {betas_all_repr}")
print(f'Mean Squared Error: {mse_all:.2f}')
Coefficient: [-35.56, -243.17, 562.75, 305.47, -662.79, 324.28, 24.78, 170.33, 731.68, 43.03]
Mean Squared Error: 2734.73