Dimensionality Reduction

Project points to lower dimension

Principle Component Analysis

Scale the features

Goal: to standardize the features to avoid one dimension’s domination

Two Popular approaches

  1. Standardization - values are centered around the mean with a unit standard deviation

$$ X_{new} = \frac{X_{old}-X_{mean}}{\text{STD}(\text{all xs})} $$

  1. Normalization - values are shifted and rescaled so that they end up ranging between 0 and 1

$$ X_{new} = \frac{X_{old}-X_{min}}{X_{max} - X_{mean}} $$

Standardization is good to use when the data follows a gaussian like distribution, and outliers would not affect this

Normalization is good to use when you know tat the distribution does not follow a

gaussian distribution

Dimensionality Reduction

Suppose data is provided in some D-dimensional space, but it can be well explained in a M-dimensional subspace for M < D

Q: How to find the appropriate subspace of dimension M?

We want to look at the projection with the highest sample variance, because this will be the most informative choice (least information will be lost)