Clustering: K-means
Dimensionality reduction: PCA
K-means is an instance of an iterative algorithm for clustering that iterated between two steps:
suppose we have N data points, $x_i$ for $1 \le i \le N$
Define an indicator variable
$$ r_{nk}=\begin{cases} 1\quad \text{if } x_n \text{is in cluster k}\\ 0 \quad otherwise \end{cases} $$
Distortion measure (the distance, the thing we want to minimize)
$$ J = \sum_{n=1}^N\sum_{k=1}^Kr_{nk}||x_n-\mu_k||^2 $$
K: total number of clusters, pre-determined
distance → distance between a point and its cluster representative
Step 0: Initialization
Start with some random initialization of the prototype
Step 1: Assignment
For each data point, find the prototype that is closest to it
$$ r_{nk}=\begin{cases} 1\quad \text{if } k= \text{argmin}||x_n-\mu_k||^2\\ 0 \quad otherwise \end{cases} $$
Step2: Refitting
Suppose indicators($r_{nk}$) are given
Distortion is a quadratic function of prototypes, given fixed indicators
Take the derivative of that and solve for the minimum
$$ \mu_k = \frac{\sum r_{nk} \cdot x_n}{\sum r_{nk}} $$