Clustering: K-means

Dimensionality reduction: PCA

K-means algorithm for clustering

K-means is an instance of an iterative algorithm for clustering that iterated between two steps:

  1. Assignment: for each data point assign the label of the closet prototype
  2. Refitting: move each cluster center (prototype) to its center of gravity.

Mathematical set-up

suppose we have N data points, $x_i$ for $1 \le i \le N$

Define an indicator variable

$$ r_{nk}=\begin{cases} 1\quad \text{if } x_n \text{is in cluster k}\\ 0 \quad otherwise \end{cases} $$

Distortion measure (the distance, the thing we want to minimize)

$$ J = \sum_{n=1}^N\sum_{k=1}^Kr_{nk}||x_n-\mu_k||^2 $$

K: total number of clusters, pre-determined

distance → distance between a point and its cluster representative

Procedure

Step 0: Initialization

Start with some random initialization of the prototype

Step 1: Assignment

For each data point, find the prototype that is closest to it

$$ r_{nk}=\begin{cases} 1\quad \text{if } k= \text{argmin}||x_n-\mu_k||^2\\ 0 \quad otherwise \end{cases} $$

Step2: Refitting

Suppose indicators($r_{nk}$) are given

Distortion is a quadratic function of prototypes, given fixed indicators

Take the derivative of that and solve for the minimum

$$ \mu_k = \frac{\sum r_{nk} \cdot x_n}{\sum r_{nk}} $$