Exercise III: Linear Regression¶

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). Wikipedia

Using linear regression with Python is as easy as running:

>>> from sklearn.linear_model import LinearRegression
>>> model = LinearRegression()
>>> model.fit(X_train, y_train)
>>> predictions = model.predict(X_test)

Diabetes¶

First, let’s reproduce scikit-learn’s Linear Regerssion Example using the prepackaged diabetes dataset.

Loading the Dataset¶

import pandas as pd

from sklearn import datasets

# Read the dataset as a pandas DataFrame
dataset = datasets.load_diabetes(as_frame=True)

# Create observations matrix and target vector
X, y = dataset.data, dataset.target

# Create a unified DataFrame containing both
data = pd.concat([X, y], axis=1)

Basic Exploration¶

Dataset Description¶

print(dataset.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

Raw Inspection¶

import numpy as np
import pandas as pd

# Select some sample indices
sample_indices = np.linspace(0, len(X) - 4, 4, dtype=int)
sample_indices = [index for i in sample_indices for index in range(i, i + 4)]

# Print data table (features and target)
data.iloc[sample_indices, :].style.set_properties(**{
    "text-align": "center",
}).set_properties(**{
    "border-left": "4px solid black"
}, subset=['target']).set_table_styles([
    dict(selector="th", props=[("font-size", "13px")]),
    dict(selector="td", props=[("font-size", "11px")]),
]).background_gradient().set_precision(3)

/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/ipykernel_launcher.py:16: FutureWarning: this method is deprecated in favour of `Styler.format(precision=..)`
  app.launch_new_instance()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
0	0.038	0.051	0.062	0.022	-0.044	-0.035	-0.043	-0.003	0.020	-0.018	151.000
1	-0.002	-0.045	-0.051	-0.026	-0.008	-0.019	0.074	-0.039	-0.068	-0.092	75.000
2	0.085	0.051	0.044	-0.006	-0.046	-0.034	-0.032	-0.003	0.003	-0.026	141.000
3	-0.089	-0.045	-0.012	-0.037	0.012	0.025	-0.036	0.034	0.023	-0.009	206.000
146	-0.031	0.051	0.060	0.001	0.012	0.032	-0.043	0.034	0.015	0.007	178.000
147	-0.056	-0.045	0.093	-0.019	0.015	0.023	-0.029	0.025	0.026	0.040	128.000
148	-0.060	0.051	0.015	-0.019	0.037	0.048	0.019	-0.003	-0.031	-0.001	96.000
149	-0.049	0.051	-0.005	-0.047	-0.021	-0.020	-0.069	0.071	0.061	-0.038	126.000
292	0.009	-0.045	-0.022	-0.032	-0.050	-0.069	0.078	-0.071	-0.063	-0.038	84.000
293	-0.071	-0.045	0.093	0.013	0.020	0.043	0.001	0.000	-0.055	-0.001	200.000
294	0.024	0.051	-0.031	-0.006	-0.017	0.018	-0.032	-0.003	-0.074	-0.034	55.000
295	-0.053	0.051	0.039	-0.040	-0.006	-0.013	0.012	-0.039	0.016	0.003	85.000
438	-0.006	0.051	-0.016	-0.068	0.049	0.079	-0.029	0.034	-0.018	0.044	104.000
439	0.042	0.051	-0.016	0.017	-0.037	-0.014	-0.025	-0.011	-0.047	0.015	132.000
440	-0.045	-0.045	0.039	0.001	0.016	0.015	-0.029	0.027	0.045	-0.026	220.000
441	-0.045	-0.045	-0.073	-0.081	0.084	0.028	0.174	-0.039	-0.004	0.003	57.000

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB

pd.set_option('float_format', '{:g}'.format)
data.describe()

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
count	442	442	442	442	442	442	442	442	442	442	442
mean	-3.63962e-16	1.30991e-16	-8.01395e-16	1.28982e-16	-9.04254e-17	1.30112e-16	-4.56397e-16	3.86317e-16	-3.8481e-16	-3.39849e-16	152.133
std	0.047619	0.047619	0.047619	0.047619	0.047619	0.047619	0.047619	0.047619	0.047619	0.047619	77.093
min	-0.107226	-0.0446416	-0.0902753	-0.1124	-0.126781	-0.115613	-0.102307	-0.0763945	-0.126097	-0.137767	25
25%	-0.0372993	-0.0446416	-0.0342291	-0.0366564	-0.0342478	-0.0303584	-0.0351172	-0.0394934	-0.0332488	-0.033179	87
50%	0.00538306	-0.0446416	-0.00728377	-0.00567061	-0.00432087	-0.00381907	-0.00658447	-0.00259226	-0.00194763	-0.0010777	140.5
75%	0.0380759	0.0506801	0.031248	0.0356438	0.028358	0.0298444	0.0293115	0.0343089	0.0324332	0.0279171	211.5
max	0.110727	0.0506801	0.170555	0.132044	0.153914	0.198788	0.181179	0.185234	0.133599	0.135612	346

Scatter Matrix¶

import matplotlib.pyplot as plt

# Extract numeric data
numeric_data = data.drop(["sex"], axis=1)

# Plot scatter matrix
axes = pd.plotting.scatter_matrix(numeric_data, figsize=(15, 15))

# Fix y-axis label formatting
new_labels = [
    round(float(i.get_text()), 2) for i in axes[0, 0].get_yticklabels()
]
_ = axes[0, 0].set_yticklabels(new_labels)

Feature Correlation¶

We can use seaborn’s heatmap function to inspect feature correlations.

import seaborn as sns

# Calculate correlation matrix using NumPy
correlation_matrix = np.corrcoef(data.values.T)

# Plot correlation matrix using seaborn
fig, ax = plt.subplots(figsize=(8, 8))
tick_labels = list(X.columns) + ['diabetes']
hm = sns.heatmap(
    correlation_matrix,
    ax=ax,
    cbar=True,  # Show colorbar
    cmap="vlag",  # Specify colormap
    vmin=-1,  # Min. value for colormapping
    vmax=1,  # Max. value for colormapping
    annot=True,  # Show the value of each cell
    square=True,  # Square aspect ratio in cell sizing
    fmt='.2f',  # Float formatting
    annot_kws={'size':
               12},  # Font size of the values displayed within the cells
    xticklabels=tick_labels,  # x-axis labels
    yticklabels=tick_labels)  # y-axis labels
plt.tight_layout()
plt.show()

Simple Linear Regression¶

Simple linear regression is a linear regression model with a single explanatory variable. bmi seems to show a discernible linear relationship with the target variable, so let’s go with that one.

Train/Test Split¶

from sklearn.model_selection import train_test_split

# Create a vector of the single predictor values
simple_X = X.bmi.to_numpy().reshape(len(X), 1)

# Split for simple linear regression
simple_X_train, simple_X_test, y_train, y_test = train_test_split(simple_X,
                                                                  y,
                                                                  random_state=0,
                                                                  test_size=0.2)

We now have a training dataset consisting of 353 observations and a test dataset with 89 observations.

Model Creation¶

from sklearn.linear_model import LinearRegression

model = LinearRegression()
_ = model.fit(simple_X_train, y_train)

Model Application¶

simple_y_pred = model.predict(simple_X_test)

Model Evaluation¶

First, we’ll plot our predicted values alongside the observed values, as well as the regression line estimated by our model.

# Create figure
fig, ax = plt.subplots(figsize=(15, 7))

# Plot real values scatter plot
_ = plt.scatter(simple_X_test, y_test, color="black", label="Real Values")

# Plot predicted values scatter plot
_ = plt.scatter(simple_X_test,
                simple_y_pred,
                color="red",
                label="Predicted Values")

# Plot regression line
_ = plt.plot(simple_X_test,
             simple_y_pred,
             color="blue",
             label="Regression Line")

# Show legend
_ = plt.legend()

# Set title
title = "Disease Progression by BMI"
plt.title(title)

# Sex axis labels
ax.set_xlabel("Standardized BMI score")
_ = ax.set_ylabel("Disease Progression")

We can also use the sklearn.metrics module to calculate the MSE and coefficient of determination ($R^2$).

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, simple_y_pred)
r_squared = r2_score(y_train, model.predict(simple_X_train))
print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f'Mean Squared Error: {mse:.2f}')
print(f'Coefficient of Determination: {r_squared:.2f}')

Coefficient: 981.66
Intercept: 152.29
Mean Squared Error: 4150.68
Coefficient of Determination: 0.38

Multiple Linear Regression¶

General notation:

$OLS$

Calculating the coefficient vector of the least-squares hyperplane:

$OLS beta hat$

Manual Implementation¶

While it is always best to use the standard tools Python offers, let’s be educationally paranoid for a minute and create our own implementation of ordinary least-squared (OLS) multiple linear regression.

A simplist implementation might look something like:

import numpy as np
import pandas as pd


def append_intercept(X: pd.DataFrame) -> pd.DataFrame:
    """
    Append a column of ones to the beginning of the dataset.

    Parameters
    ----------
    X : pd.DataFrame
        Observed features data

    Returns
    -------
    pd.DataFrame
        Design matrix
    """
    intercept = np.ones((len(X), 1))
    return np.hstack([intercept, X])


def estimate_coefficients(X: pd.DataFrame, y: np.ndarray) -> np.ndarray:
    """
    Returns the estimated beta hat vector.

    Parameters
    ----------
    X : pd.DataFrame
        Observed features data
    y : np.ndarray
        Target vector

    Returns
    -------
    np.ndarray
        Estimated coefficients
    """
    X = append_intercept(X)
    return np.linalg.inv(X.T @ X) @ (X.T @ y)


def calculate_predictions(X: pd.DataFrame, beta_hat: np.ndarray) -> np.ndarray:
    """
    Returns the predicted values for a test set of observations (X) using a
    vector of the estimated coefficients.

    Parameters
    ----------
    X : pd.DataFrame
        Test features data
    beta_hat : np.ndarray
        Estimated coefficients vector

    Returns
    -------
    np.ndarray
        Estimated coefficients
    """
    X = append_intercept(X)
    return X @ beta_hat

Another version, which better conforms with standard implementations in machine learning models, might look something like:

class ManualLinearRegression:
    """
    Manual implementation of an Ordinary Least Squares (OLS) linear regression
    model.
    """

    def __init__(self) -> None:
        """
        Initializes a new instance of the ManualLinearRegression model.
        """
        self.beta_hat = None

    def append_intercept(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Append a column of ones to the beginning of the dataset.

        Parameters
        ----------
        X : pd.DataFrame
            Observed features data

        Returns
        -------
        pd.DataFrame
            Design matrix
        """
        intercept = np.ones((len(X), 1))
        return np.hstack([intercept, X])

    def estimate_coefficients(
        self, X: pd.DataFrame, y: np.ndarray
    ) -> np.ndarray:
        """
        Returns the estimated beta hat vector.

        Parameters
        ----------
        X : pd.DataFrame
            Observed features data
        y : np.ndarray
            Target vector

        Returns
        -------
        np.ndarray
            Estimated coefficients
        """
        X = self.append_intercept(X)
        return np.linalg.inv(X.T @ X) @ (X.T @ y)

    def fit(self, X: pd.DataFrame, y: np.ndarray) -> None:
        """
        Calculate the estimated coefficients vector (beta hat).

        Parameters
        ----------
        X : pd.DataFrame
            Observed features data
        y : np.ndarray
            Target vector
        """
        self.beta_hat = self.estimate_coefficients(X, y)

    def predict(self, X: pd.DataFrame):
        """
        Returns the predicted values for a test set of observations (X) using a
        vector of the estimated coefficients.

        Parameters
        ----------
        X : pd.DataFrame
            Test features data
        beta_hat : np.ndarray
            Estimated coefficients vector

        Returns
        -------
        np.ndarray
            Estimated coefficients
        """
        if self.beta_hat is not None:
            X = self.append_intercept(X)
            return X @ self.beta_hat
        raise RuntimeError("Please fit the model before calling predict().")

Model Application¶

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=0,
                                                    test_size=0.2)
beta_hat = estimate_coefficients(X_train, y_train)
y_pred_manual = calculate_predictions(X_test, beta_hat)

or:

manual_model = ManualLinearRegression()
manual_model.fit(X_train, y_train)
y_pred_manual = manual_model.predict(X_test)

Model Evaluation¶

$R^2$¶

Calculate the residual sum of squares ($RSS$):

General notation

$RSS general notation$

Matrix notation

$RSS linear notation$

ss_residual = np.sum((y_train - manual_model.predict(X_train))**2)
ss_residual

965359.4331583097

Calculate the total sum of squares ($TSS$):

$TSS$

ss_total = np.sum((y_train - y_train.mean())**2)
ss_total

2164136.266288952

Calculate $R^2$:

$R squared$

r2_score_manual = 1 - ss_residual / ss_total
r2_score_manual

0.5539285357415583

Compare with sklearn’s built-in r2_score function:

r2_score(y_true=y_train, y_pred=manual_model.predict(X_train))

0.5539285357415583

Mean Squared Error¶

Manual $MSE$ calculation:

mse_manual = np.mean((y_train - manual_model.predict(X_train))**2)
mse_manual

2734.7292724031436

Or using sklearn:

mean_squared_error(y_train, manual_model.predict(X_train))

2734.7292724031436

Multiple Linear Regression using `sklearn`¶

Model Creation¶

multiple_model = LinearRegression()
_ = multiple_model.fit(X_train, y_train)

Model Application¶

y_pred_all = multiple_model.predict(X_test)

Model Evaluation¶

mse_all = mean_squared_error(y_train, multiple_model.predict(X_train))
r2_score_all = r2_score(y_train, multiple_model.predict(X_train))

betas_all_repr = [beta.round(2) for beta in multiple_model.coef_.flatten()]

print(f"Coefficient: {betas_all_repr}")
print(f'Mean Squared Error: {mse_all:.2f}')

Coefficient: [-35.56, -243.17, 562.75, 305.47, -662.79, 324.28, 24.78, 170.33, 731.68, 43.03]
Mean Squared Error: 2734.73

Machine Learning for Neuroscience

Exercise III: Linear Regression¶

Diabetes¶

Loading the Dataset¶

Basic Exploration¶

Dataset Description¶

Raw Inspection¶

Scatter Matrix¶

Feature Correlation¶

Simple Linear Regression¶

Train/Test Split¶

Model Creation¶

Model Application¶

Model Evaluation¶

Multiple Linear Regression¶

Manual Implementation¶

Model Application¶

Model Evaluation¶

\(R^2\)¶

Mean Squared Error¶

Multiple Linear Regression using `sklearn`¶

Model Creation¶

Model Application¶

Model Evaluation¶

Machine Learning for Neuroscience

Exercise III: Linear Regression¶

Diabetes¶

Loading the Dataset¶

Basic Exploration¶

Dataset Description¶

Raw Inspection¶

Scatter Matrix¶

Feature Correlation¶

Simple Linear Regression¶

Train/Test Split¶

Model Creation¶

Model Application¶

Model Evaluation¶

Multiple Linear Regression¶

Manual Implementation¶

Model Application¶

Model Evaluation¶

\(R^2\)¶

Mean Squared Error¶

Multiple Linear Regression using sklearn¶

Model Creation¶

Model Application¶

Model Evaluation¶

Multiple Linear Regression using `sklearn`¶