Exercise VIII: K-Means Clustering and PCA

Iris Dataset

Let’s start things off with a simple application of our subject material on the Iris dataset to better understand our workflow and the relevant metrics.

Set up

import pandas as pd

from sklearn.datasets import load_iris


data = load_iris(as_frame=True)

X_iris, y_iris = data.data, data.target
y_iris = y_iris.replace(
    {index: name
     for index, name in enumerate(data["target_names"])})

Dimensionality Reduction using Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

IRIS_N_COMPONENTS = X_iris.shape[1] - 1

iris_pca = PCA(n_components=IRIS_N_COMPONENTS)
_ = iris_pca.fit(X_iris)

Let’s try and evaluate how well did PCA manage to find small number of components to explain the variance:

import matplotlib.pyplot as plt
import numpy as np

from plots.explained_variance import create_explained_variance_df, plot_explained_variance_ratio


# Create explained variance ratio figure
fig_kwargs = {"figsize": (6, 6)}
iris_explained_variance_fig, iris_explained_variance_ax = plot_explained_variance_ratio(
    iris_pca, fig_kwargs=fig_kwargs)

# Create explained variance ratio dataframe
explained_variance_df = create_explained_variance_df(iris_pca)
  Individual Component Cumulative
# Components    
1 0.924619 0.924619
2 0.053066 0.977685
3 0.017103 0.994788
../../_images/exercise_08_11_0.png

We’ve managed to capture 97.77% of the variance using only 2 components, and 99.48% percent using 3, not bad at all!

To calculate the coordinates of the dataset in the dimensions of the calculated principal components, we have to call tranform():

X_iris_reduced = iris_pca.transform(X_iris)
X_iris_reduced.shape
(150, 3)

Note

As with many other classes that implement this general procedure, sklearn also provides a fit_transform() function to do both more conveniently.

Fortunately, 2 and 3 dimensions are also simple enough to visualize:

from plots.pca_scatter import plot_reduced


y_iris_labels = y_iris.replace({"setosa": 0, "versicolor": 1, "virginica": 2}).values

plot_reduced(X_iris_reduced, y_iris_labels)
plot_reduced(X_iris_reduced[:, :2], true_labels=y_iris_labels)
../../_images/exercise_08_20_0.png

Ah, we can already guess the setosa class to be that discernible cloud on the left of both dimensionality-reduced scatters.

K-means Clustering

Next, we could try and identify the underlying classes or Iris genera and comparing our results against the actual labels. Essentially, we are checking how does the reduction of the feature space using PCA impact our ability to detect the different iris genera using K-means clustering.

from sklearn.cluster import KMeans


N_IRIS_CLASSES = 3

iris_kmeans = KMeans(n_clusters=N_IRIS_CLASSES, random_state=0)

Warning

Note that this is an educational example meant to improve our understanding of dimensionality reduction and clustering, not to represent a relevant use case for clustering. In most cases, clustering is an appealing choice when we don’t actually know the labels (which is why it is considered an unsupervised learning method).

iris_labels_4 = iris_kmeans.fit_predict(X_iris)
iris_labels_3 = iris_kmeans.fit_predict(X_iris_reduced[:, :3])
iris_labels_2 = iris_kmeans.fit_predict(X_iris_reduced[:, :2])
iris_labels_1 = iris_kmeans.fit_predict(X_iris_reduced[:, :1].reshape(-1, 1))

Again, we’ll plot the observations in their reduced spaces, only this time we will include the cluster index (represented with the color of each observation), as well as the classification’s success relative to the true labels (represented with circles for correct versus “X” for incorrect classifications).

plot_reduced(X_iris_reduced, iris_labels_3, y_iris_labels)
plot_reduced(X_iris_reduced[:, :2], iris_labels_2, y_iris_labels)
plt.show()
../../_images/exercise_08_29_0.png

Clustering Evaluation

How do you think we did relative to the baseline (clustering using all 4 features) as we reduced the dimensionality of the dataset?

from plots.accuracy import plot_accuracy_by_pc


classification_labels = (iris_labels_1, iris_labels_2, iris_labels_3,
                         iris_labels_4)
_ = plot_accuracy_by_pc(classification_labels, y_iris_labels)
../../_images/exercise_08_33_0.png

Surprisingly, reducing the dimensionality of the iris dataset to 1 has improved the performance of K-means classification significantly when compared to the true labels.

sklearn’s metrics module also provides us with the silhouette_score metric for evaluating the degree of separation between the clusters:

from plots.silhouette import plot_silhouette_scores_by_pc


_ = plot_silhouette_scores_by_pc(X_iris_reduced, classification_labels)
../../_images/exercise_08_36_0.png

It seems that using only a single principal component yielded not only the best results when compared to the true labels, but also the greatest degree of separation between the clusters (as estimated using silhouette_score).

ASD Subtypes Classification

So, now we know how to implement both PCA and K-means clustering using sklearn. Next, we will try and explore the possibility of identifying ASD subtypes using and on the ABIDE II dataset, again, based on Prof. Tal Yarkoni’s workshop in the NeuroHackademy 2020 (available here). This procedure is somewhat similar to what we did with the iris dataset (in terms of combining PCA and K-means to identify clusters in a dimensionality-reduced space), however, this time the labels are not known to us, and we will try to evaluate whether the results of our analysis produce convincing results.

First, as always, we have to load the data and clean it up a little bit:

TSV_URL = "https://raw.githubusercontent.com/neurohackademy/nh2020-curriculum/master/tu-machine-learning-yarkoni/data/abide2.tsv"
data = pd.read_csv(TSV_URL, delimiter="\t")

# Clean up
IGNORED_COLUMNS = ["age_resid", "sex"]
REPLACE_DICT = {"group": {1: "ASD", 2: "Control"}}
data.drop(columns=IGNORED_COLUMNS, inplace=True)
data.replace(REPLACE_DICT, inplace=True)

This time, however, we’re only interested in ASD positive observations:

asd_positive = data["group"] == "ASD"

# Feature matrix
X = data[asd_positive].filter(regex="^fs").copy()
X
fsArea_L_V1_ROI fsArea_L_MST_ROI fsArea_L_V6_ROI fsArea_L_V2_ROI fsArea_L_V3_ROI fsArea_L_V4_ROI fsArea_L_V8_ROI fsArea_L_4_ROI fsArea_L_3b_ROI fsArea_L_FEF_ROI ... fsCT_R_p47r_ROI fsCT_R_TGv_ROI fsCT_R_MBelt_ROI fsCT_R_LBelt_ROI fsCT_R_A4_ROI fsCT_R_STSva_ROI fsCT_R_TE1m_ROI fsCT_R_PI_ROI fsCT_R_a32pr_ROI fsCT_R_p24_ROI
0 2750.0 306.0 354.0 2123.0 1451.0 1128.0 269.0 1751.0 1338.0 632.0 ... 3.362 2.827 2.777 2.526 3.202 3.024 3.354 2.629 2.699 3.179
1 2836.0 186.0 354.0 2261.0 1584.0 1241.0 259.0 1521.0 1105.0 302.0 ... 2.809 3.539 2.944 2.769 3.530 3.079 3.282 2.670 2.746 3.324
3 3382.0 266.0 422.0 2686.0 1893.0 1359.0 234.0 1889.0 1545.0 407.0 ... 3.349 3.344 2.694 3.030 3.258 2.774 3.383 2.696 3.014 3.264
5 3020.0 390.0 408.0 2455.0 1597.0 1219.0 280.0 2154.0 1733.0 528.0 ... 3.017 3.202 3.242 2.698 3.035 2.996 3.261 3.310 3.165 2.637
6 3522.0 140.0 471.0 2935.0 2007.0 1483.0 195.0 1709.0 1353.0 447.0 ... 2.338 3.297 2.941 2.685 3.280 2.912 2.523 3.338 2.926 3.333
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
992 3231.0 202.0 398.0 2699.0 1584.0 1018.0 137.0 1992.0 1378.0 557.0 ... 3.329 3.192 2.415 2.698 3.230 2.815 2.756 2.872 3.107 3.069
995 2679.0 218.0 309.0 2041.0 1473.0 1100.0 232.0 1731.0 1183.0 436.0 ... 2.561 3.458 2.758 2.756 2.945 2.560 2.651 2.703 2.248 3.410
996 1739.0 229.0 454.0 1600.0 1006.0 755.0 163.0 1618.0 1279.0 361.0 ... 2.905 3.615 2.658 2.498 3.499 3.429 3.123 2.511 3.530 3.340
998 3393.0 319.0 583.0 2996.0 2057.0 1451.0 259.0 1773.0 1455.0 662.0 ... 2.684 3.193 2.824 2.728 3.558 2.787 3.765 2.893 2.632 3.485
1003 2649.0 140.0 307.0 2359.0 1446.0 1003.0 198.0 1807.0 1325.0 395.0 ... 3.287 3.170 2.494 2.725 3.477 2.850 3.695 2.637 3.263 2.981

463 rows × 1440 columns

Principal Component Analysis (PCA)

We will use PCA to try and “capture” as much of the variance of our 1,440 features using a limited number of principal components.

from sklearn.decomposition import PCA

N_COMPONENTS = 20

pca = PCA(n_components=N_COMPONENTS)
X_reduced = pca.fit_transform(X)

Let’s try and evaluate how well did PCA manage to find small number of components to explain the variance:

_ = plot_explained_variance_ratio(pca)
../../_images/exercise_08_48_0.png

It seems like the added value of using more than 2 or 3 components is relatively negligible, with 3 components explaining a total of 49.43% of the total variance.

K-Means Clustering

Clustering Evaluation

from plots.silhouette import plot_silhouette_scores_by_k


X_3 = X_reduced[:, :3]
_ = plot_silhouette_scores_by_k(X_3, max_k=10)
../../_images/exercise_08_53_0.png

The best degree of separation was obtained for \(k=2\). Let’s visualize the two clusters and obtain some qualitative understanding of how well (or badly) our model did:

classification_labels = KMeans(n_clusters=2, random_state=0).fit(X_3).labels_
plot_reduced(X_3, classification_labels)

Indeed, it seems our model’s ability to cluster ASD subtypes using three principal components is rather poor.

Back to the Drawing Board