Dimensionality Reduction

Project points to lower dimension

Principle Component Analysis

Scale the features

Goal: to standardize the features to avoid one dimension’s domination

Standardization - values are centered around the mean with a unit standard deviation

$$ X_{new} = \frac{X_{old}-X_{mean}}{\text{STD}(\text{all xs})} $$

Normalization - values are shifted and rescaled so that they end up ranging between 0 and 1

$$ X_{new} = \frac{X_{old}-X_{min}}{X_{max} - X_{mean}} $$

Standardization is good to use when the data follows a gaussian like distribution, and outliers would not affect this

Normalization is good to use when you know tat the distribution does not follow a

gaussian distribution

Suppose data is provided in some D-dimensional space, but it can be well explained in a M-dimensional subspace for M < D

We want to look at the projection with the highest sample variance, because this will be the most informative choice (least information will be lost)

Suppose we have N data points in a D-dimensional space, if we want to project the data onto the most informative dimension, let $u_1$ be a vector in that dimension
Recall sample mean. By linearity, projected sample mean will be the same as the projection of sample mean
Sample variance: $S = \frac{1}{N}\sum^N_{n=1}(X_n - \bar X)(X_n - \bar X)^T$