Clustering | Notion

Clustering: K-means

Dimensionality reduction: PCA

K-means algorithm for clustering

K-means is an instance of an iterative algorithm for clustering that iterated between two steps:

suppose we have N data points, $x_i$ for $1 \le i \le N$

Define an indicator variable

$$ r_{nk}=\begin{cases} 1\quad \text{if } x_n \text{is in cluster k}\\ 0 \quad otherwise \end{cases} $$

Distortion measure (the distance, the thing we want to minimize)

$$ J = \sum_{n=1}^N\sum_{k=1}^Kr_{nk}||x_n-\mu_k||^2 $$

K: total number of clusters, pre-determined

distance → distance between a point and its cluster representative

Step 0: Initialization

Start with some random initialization of the prototype

Step 1: Assignment

For each data point, find the prototype that is closest to it

$$ r_{nk}=\begin{cases} 1\quad \text{if } k= \text{argmin}||x_n-\mu_k||^2\\ 0 \quad otherwise \end{cases} $$

Step2: Refitting

Suppose indicators($r_{nk}$) are given

Distortion is a quadratic function of prototypes, given fixed indicators

Take the derivative of that and solve for the minimum

$$ \mu_k = \frac{\sum r_{nk} \cdot x_n}{\sum r_{nk}} $$

One evaluation of step 1(assignment) and one evaluation of step 2(derivatives) constitute one iteration
Iterate until a stopping criterion is satisfied
- Stopping criterion: when the distortion stops decreasing