Exercise V: Regularization¶

This exercise is an adaptation of lab 2 from the ISLR, in which the salary of baseball players is predicted on the basis of various statistics associated with performance in the previous year.

Set up¶

import pandas as pd

URL = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Hitters.csv"
data = pd.read_csv(URL, index_col=0)

General Inspection and Preprocessing¶

data

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	Salary	NewLeague
-Andy Allanson	293	66	1	30	29	14	1	293	66	1	30	29	14	A	E	446	33	20	NaN	A
-Alan Ashby	315	81	7	24	38	39	14	3449	835	69	321	414	375	N	W	632	43	10	475.0	N
-Alvin Davis	479	130	18	66	72	76	3	1624	457	63	224	266	263	A	W	880	82	14	480.0	A
-Andre Dawson	496	141	20	65	78	37	11	5628	1575	225	828	838	354	N	E	200	11	3	500.0	N
-Andres Galarraga	321	87	10	39	42	30	2	396	101	12	48	46	33	N	E	805	40	4	91.5	N
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
-Willie McGee	497	127	7	65	48	37	5	2703	806	32	379	311	138	N	E	325	9	3	700.0	N
-Willie Randolph	492	136	5	76	50	94	12	5511	1511	39	897	451	875	A	E	313	381	20	875.0	A
-Wayne Tolleson	475	126	3	61	43	52	6	1700	433	7	217	93	146	A	W	37	113	7	385.0	A
-Willie Upshaw	573	144	9	85	60	78	8	3198	857	97	470	420	332	A	E	1314	131	12	960.0	A
-Willie Wilson	631	170	9	77	44	31	11	4908	1457	30	775	357	249	A	W	408	4	3	1000.0	A

322 rows × 20 columns

Well, it looks like a dataset.

Let’s probe a little deeper:

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 322 entries, -Andy Allanson to -Willie Wilson
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 AtBat      322 non-null    int64  
 Hits       322 non-null    int64  
 HmRun      322 non-null    int64  
 Runs       322 non-null    int64  
 RBI        322 non-null    int64  
 Walks      322 non-null    int64  
 Years      322 non-null    int64  
 CAtBat     322 non-null    int64  
 CHits      322 non-null    int64  
 CHmRun     322 non-null    int64  
CRuns      322 non-null    int64  
CRBI       322 non-null    int64  
CWalks     322 non-null    int64  
League     322 non-null    object 
Division   322 non-null    object 
PutOuts    322 non-null    int64  
Assists    322 non-null    int64  
Errors     322 non-null    int64  
Salary     263 non-null    float64
NewLeague  322 non-null    object 
dtypes: float64(1), int64(16), object(3)
memory usage: 52.8+ KB

Calling info() reveals that most (16/19) columns contain integers, except for Salary which is a float and League, NewLeague, and Division which are object (categorical). Even more importantly, it shows that Salary contains missing observations. Because Salary is our target variable, we don’t really have any use for these observations and would be better off simply dropping them from our dataset:

data.dropna(subset=["Salary"], inplace=True)

Done.

Next, we might give describe() a quick look:

pd.set_option('float_format', '{:.4g}'.format)
data.describe()

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors	Salary
count	263	263	263	263	263	263	263	263	263	263	263	263	263	263	263	263	263
mean	403.6	107.8	11.62	54.75	51.49	41.11	7.312	2658	722.2	69.24	361.2	330.4	260.3	290.7	118.8	8.593	535.9
std	147.3	45.13	8.757	25.54	25.88	21.72	4.794	2287	648.2	82.2	331.2	323.4	264.1	279.9	145.1	6.607	451.1
min	19	1	0	0	0	0	1	19	4	0	2	3	1	0	0	0	67.5
25%	282.5	71.5	5	33.5	30	23	4	842.5	212	15	105.5	95	71	113.5	8	3	190
50%	413	103	9	52	47	37	6	1931	516	40	250	230	174	224	45	7	425
75%	526	141.5	18	73	71	57	10	3890	1054	92.5	497.5	424.5	328.5	322.5	192	13	750
max	687	238	40	130	121	105	24	1.405e+04	4256	548	2165	1659	1566	1377	492	32	2460

Scaling¶

I seems like there’s fairly large variabiliy in scale between the different features, and in any case regularization requires some form of standardization.

First, let’s differntiate the features matrix (X) from the target vector (y):

target_name = "Salary"

X = data.drop(columns=[target_name])
y = data[target_name].copy()

Only numeric columns should be standardized, so let’s also be sure to distinguish those from the categoricals:

numeric_features = X.select_dtypes([int, float])

And we’re all set:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X.loc[:, numeric_features.columns] = scaler.fit_transform(
    X.loc[:, numeric_features.columns])

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	League	Division	PutOuts	Assists	Errors	NewLeague
-Alan Ashby	-0.6029	-0.5957	-0.5286	-1.206	-0.5221	-0.09753	1.398	0.3468	0.1744	-0.00292	-0.1217	0.259	0.4353	N	W	1.221	-0.5232	0.2134	N
-Alvin Davis	0.5125	0.4923	0.73	0.4415	0.7941	1.609	-0.9012	-0.4529	-0.4099	-0.07605	-0.4151	-0.1996	0.01037	A	W	2.109	-0.2539	0.82	A
-Andre Dawson	0.6282	0.7365	0.9588	0.4023	1.026	-0.1898	0.7709	1.302	1.318	1.899	1.412	1.573	0.3557	N	E	-0.3247	-0.7442	-0.8482	N
-Andres Galarraga	-0.5621	-0.4625	-0.1853	-0.6177	-0.3672	-0.5127	-1.11	-0.9909	-0.9602	-0.6977	-0.9475	-0.8812	-0.8623	N	E	1.841	-0.5439	-0.6966	N
-Alfredo Griffin	1.295	1.358	-0.8718	0.7553	-0.01884	-0.2821	0.7709	0.767	0.635	-0.6124	0.4228	0.01729	-0.2514	A	W	-0.03118	2.087	2.488	A
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
-Willie McGee	0.635	0.4257	-0.5286	0.4023	-0.135	-0.1898	-0.4832	0.01992	0.1295	-0.4539	0.05378	-0.06016	-0.4639	N	E	0.1227	-0.758	-0.8482	N
-Willie Randolph	0.601	0.6255	-0.7574	0.8338	-0.05755	2.44	0.9799	1.25	1.219	-0.3686	1.621	0.3736	2.332	A	E	0.07977	1.811	1.73	A
-Wayne Tolleson	0.4853	0.4034	-0.9862	0.2454	-0.3285	0.5022	-0.2742	-0.4196	-0.447	-0.7586	-0.4363	-0.7356	-0.4336	A	W	-0.9081	-0.03978	-0.2416	A
-Willie Upshaw	1.152	0.8031	-0.2997	1.187	0.3295	1.702	0.1438	0.2368	0.2084	0.3384	0.3291	0.2776	0.2722	A	E	3.662	0.08452	0.5167	A
-Willie Wilson	1.546	1.38	-0.2997	0.873	-0.2898	-0.4666	0.7709	0.9861	1.136	-0.4783	1.252	0.08236	-0.04275	A	W	0.4198	-0.7925	-0.8482	A

263 rows × 19 columns

X.describe()

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors
count	263	263	263	263	263	263	263	263	263	263	263	263	263	263	263	263
mean	1.013e-17	5.741e-17	3.377e-17	-5.066e-17	1.216e-16	1.689e-18	-5.403e-17	6.079e-17	6.754e-17	5.403e-17	3.377e-17	4.053e-17	1.081e-16	7.43e-17	0	1.047e-16
std	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002	1.002
min	-2.616	-2.372	-1.329	-2.148	-1.993	-1.897	-1.319	-1.156	-1.11	-0.844	-1.087	-1.014	-0.9837	-1.04	-0.8201	-1.303
25%	-0.8239	-0.8066	-0.7574	-0.8334	-0.8317	-0.8356	-0.6922	-0.7953	-0.7886	-0.6611	-0.7736	-0.7294	-0.7181	-0.6343	-0.7649	-0.8482
50%	0.06364	-0.1072	-0.2997	-0.1077	-0.1737	-0.1898	-0.2742	-0.3183	-0.3187	-0.3564	-0.3365	-0.3111	-0.3273	-0.2388	-0.5094	-0.2416
75%	0.8322	0.7476	0.73	0.7161	0.7554	0.7329	0.5619	0.5402	0.5129	0.2835	0.4123	0.2915	0.2589	0.1138	0.5058	0.6683
max	1.927	2.89	3.247	2.952	2.691	2.947	3.488	4.993	5.462	5.836	5.457	4.116	4.954	3.888	2.578	3.55

Warning

Try to consider the way in which standardization was applied. Can you think of anything that could pose a problem in following steps?

Pairwise Scatter-matrix¶

_ = pd.plotting.scatter_matrix(data, figsize=(15, 15))

Dummy Encoding¶

To view the different values each categorical variable holds:

for feature in X.select_dtypes(object):
    print(feature.ljust(10), X[feature].unique())

League     ['N' 'A']
Division   ['W' 'E']
NewLeague  ['N' 'A']

And to “dummy encode” it is as simple as:

X = pd.get_dummies(X, drop_first=True)

X.sample(20).select_dtypes(include=["uint8"]).style.background_gradient()

	League_N	Division_W	NewLeague_N
-Ken Landreaux	1	1	1
-Ron Hassey	0	0	0
-Shawon Dunston	1	0	1
-Chris Bando	0	0	0
-Don Mattingly	0	0	0
-Ken Phelps	0	1	0
-Gary Carter	1	0	1
-Al Newman	1	0	0
-Rick Dempsey	0	0	0
-Wally Joyner	0	1	0
-Kevin Bass	1	1	1
-Leon Durham	1	0	1
-Jesse Barfield	0	0	0
-Brian Downing	0	1	0
-Tom Brunansky	0	1	0
-Terry Kennedy	1	1	0
-Jody Davis	1	0	1
-Dwight Evans	0	0	0
-Argenis Salazar	0	1	0
-Ted Simmons	1	1	1

Model Creation¶

In the previous linear regression exercise we used sklearn’s LinearRegression class along with the train_test_split() function to model the relationship between the features and the target variable and evaluate it’s performance in predicting the target variable for an out-of-sample collection of observations. In this exercise, we will use LassoCV and RidgeCV to introduce the \(\ell_1\) and \(\ell_2\) penalties as part of our fitting process and regularize the coefficients of our predictors. In addition, cross-validation will also be taken care of automatically.

For the sake of uniformity we’ll use the same list of regularization parameter values:

import numpy as np

alphas = np.linspace(0.01, 100, 200)

Lasso (\(\ell_1\))

from sklearn.linear_model import LassoCV

lasso_model = LassoCV(alphas=alphas, max_iter=10000)
_ = lasso_model.fit(X, y)

Ridge (\(\ell_2\))

from sklearn.linear_model import RidgeCV

ridge_model = RidgeCV(alphas=alphas, store_cv_values=True)
_ = ridge_model.fit(X, y)

Note

By default, LassoCV will use 5-fold cross-validation with \(R^2\) as the scoring method and RidgeCV will use Generalized Cross-Validation, which is a form of efficient Leave-One-Out cross-validation, with negative mean squared error.

To change the splitting strategy, set the cv keyword argument to the number of folds you’d like to use, to specify scoring and other search parameters see the GridSearchCV class.

Model Application¶

To calculate the predicted salary of each player using the model:

Lasso (\(\ell_1\))

lasso_predictions = lasso_model.predict(X)

Ridge (\(\ell_2\))

ridge_predictions = ridge_model.predict(X)

Model Evaluation¶

import matplotlib.pyplot as plt


def plot_predictions(y: pd.Series, predictions: pd.Series, ax: plt.Axes):
    ax.scatter(y, predictions, color="black")
    ax.plot(y, y, color="red")
    ax.set_xlabel("Real Values")
    ax.set_ylabel("Predicted Values")


fig, ax = plt.subplots(ncols=2, figsize=(12, 5))
plot_predictions(y, lasso_predictions, ax[0])
ax[0].set_title(r"Lasso ($\ell_1$)")
plot_predictions(y, ridge_predictions, ax[1])
_ = ax[1].set_title(r"Lasso ($\ell_2$)")

Alpha¶

Lasso (\(\ell_1\))

lasso_model.alpha_

2.522311557788944

Ridge (\(\ell_2\))

ridge_model.alpha_

3.527236180904522

Coefficients¶

coefficients_dict = {"Lasso": lasso_model.coef_, "Ridge": ridge_model.coef_}
coefficients = pd.DataFrame(coefficients_dict, index=X.columns)
coefficients.T

	AtBat	Hits	HmRun	Runs	RBI	Walks	Years	CAtBat	CHits	CHmRun	CRuns	CRBI	CWalks	PutOuts	Assists	Errors	League_N	Division_W	NewLeague_N
Lasso	-237.2	261.6	0	-0	0	104.8	-46.64	-0	0	44.29	224.5	123.9	-144.8	76.81	26.48	-13.56	27.04	-114.1	0
Ridge	-216.4	230.7	0.9447	0.7865	6.002	106.3	-51.34	-94.18	119.4	58.18	199.9	114.5	-144.5	77.68	39.18	-24.86	50.73	-118.6	-18.02

from sklearn.linear_model import Lasso, Ridge


coef_fig, coef_ax = plt.subplots(ncols=2, figsize=(14, 6))

#########
# Lasso #
#########

simple_lasso = Lasso(max_iter=10_000)
lasso_coefficient_values = []
lasso_alphas = lasso_model.alphas_[::-1]

for alpha in lasso_alphas:
    simple_lasso.set_params(alpha=alpha)
    simple_lasso.fit(X, y)
    lasso_coefficient_values.append(simple_lasso.coef_)

# Coefficient values by alpha
coef_ax[0].semilogx(lasso_alphas, lasso_coefficient_values, linestyle="dashed", alpha=0.5)

# Chosen alpha
coef_ax[0].axvline(lasso_model.alpha_,
                  linestyle='--',
                  color="black",
                  label=r"Chosen $\alpha$")
coef_ax[0].legend()
coef_ax[0].set_xlabel(r'$\alpha$')
coef_ax[0].set_ylabel("Coefficient Values")
coef_ax[0].set_title(r"Lasso: Coefficient Values by $\alpha$")

#########
# Ridge #
#########

simple_ridge = Ridge()
ridge_coefficient_values = []
ridge_alphas = ridge_model.alphas[::-1]

for alpha in ridge_alphas:
    simple_ridge.set_params(alpha=alpha)
    simple_ridge.fit(X, y)
    ridge_coefficient_values.append(simple_ridge.coef_)


# Coefficient values by alpha
coef_ax[1].semilogx(ridge_alphas,
                   ridge_coefficient_values,
                   linestyle="dashed",
                   alpha=0.5)
# Chosen alpha
coef_ax[1].axvline(ridge_model.alpha_,
                  linestyle='--',
                  color="black",
                  label=r"Chosen $\alpha$")
coef_ax[1].legend(X.columns, loc="upper left", bbox_to_anchor=(1.05, 0.95))
coef_ax[1].set_xlabel(r'$\alpha$')
coef_ax[1].set_ylabel("Coefficient Values")
_ = coef_ax[1].set_title(r"Ridge: Coefficient Values by $\alpha$")

Mean Squared Error (MSE)¶

mse_fig, mse_ax = plt.subplots(ncols=2, figsize=(14, 6))

#########
# Lasso #
#########

lasso_alphas = lasso_model.alphas_
lasso_mse = lasso_model.mse_path_

# Cross-validation MSEs
mse_ax[0].semilogx(lasso_alphas,
                   lasso_mse,
                   linestyle="dashed",
                   color="blue",
                   alpha=0.2)

# Mean MSE
mse_ax[0].plot(lasso_alphas,
               lasso_mse.mean(axis=-1),
               color="black",
               label="Mean MSE Across Folds",
               linewidth=2)
# Chosen alpha
mse_ax[0].axvline(lasso_model.alpha_,
                  linestyle='--',
                  color="black",
                  label=r"Chosen $\alpha$")
mse_ax[0].legend()
mse_ax[0].set_xlabel(r'$\alpha$')
mse_ax[0].set_ylabel("MSE")
mse_ax[0].set_title("Lasso: Mean Squared Error (MSE) Across Folds")

#########
# Ridge #
#########

ridge_alphas = ridge_model.alphas
ridge_mse = ridge_model.cv_values_.T

# Cross-validation MSEs
mse_ax[1].semilogx(ridge_alphas,
                   ridge_mse,
                   linestyle="dashed",
                   color="blue",
                   alpha=0.2)
# Mean MSE
mse_ax[1].plot(ridge_alphas,
               ridge_mse.mean(axis=-1),
               color="black",
               label="Mean MSE Across Folds",
               linewidth=2)
# Chosen alpha
mse_ax[1].axvline(ridge_model.alpha_,
                  linestyle='--',
                  color="black",
                  label=r"Chosen $\alpha$")
mse_ax[1].legend()
mse_ax[1].set_xlabel(r'$\alpha$')
mse_ax[1].set_ylabel("MSE")
mse_ax[1].ticklabel_format(style="plain", axis="y")
mse_ax[1].set_ylim(mse_ax[0].get_ylim())
_ = mse_ax[1].set_title("Ridge: Mean Squared Error (MSE) Across Folds")

\(R^2\)¶

To calculate in-sample \(R^2\) score for each model:

Lasso (\(\ell_1\))

lasso_model.score(X, y)

0.5342574616741457

Ridge (\(\ell_2\))

ridge_model.score(X, y)

0.5364706351171298

Note that calculating in-sample \(R^2\) scores will result in underestimating prediction error.

Note

For a more generalized solution for parameter tuning, see sklearn’s GridSearchCV.

Machine Learning for Neuroscience

Exercise V: Regularization¶

Set up¶

General Inspection and Preprocessing¶

Scaling¶

Pairwise Scatter-matrix¶

Dummy Encoding¶

Model Creation¶

Model Application¶

Model Evaluation¶

Alpha¶

Coefficients¶

Mean Squared Error (MSE)¶

\(R^2\)¶