Screen Shot 2022-04-14 at 10.29.29 AM.png

Bootstrapping

Since we don’t have that many data points to get the confidence interval, we can use the idea of bootstrapping. Bootstrapping: the practice of sampling from the observed data (X,Y) for estimating statistical properties.

For instance, we can do sampling with replacement to create multiple samples. And from those samples, we can perform multiple trainings to get multiple fittings

This would decrease the effects of the outliers

Standard Error

Quantifying the Model

Once we have multiple regression models, we can use standard error to quantify them.

Assume $y = \beta_0 + \beta_1x + \epsilon$ and model $\epsilon$ as a random variable with mean zero and variance $\sigma^2$

$$ SE(\hat\beta_1) = \sqrt{VAR(\beta_1)}\\

\text{VAR}(\beta_1) = \text{VAR}(\frac{\sum_i(x_i-\bar x)(y_i-\bar y)}{\sum_i(x_i-\bar x)^2}) \\ \vdots\\ \text{VAR}(\beta_1)=\frac{\sigma^2}{\sum_i(x_i-\bar x)^2} $$

so the standard error is thus

$$ SE(\hat\beta_1) = \frac{\sigma}{\sqrt{\sum_i(x_i-\bar x)^2}} $$

Similarly,

$$ \text{VAR}(\beta_0)=\text{VAR}(y-\beta_1x)\\\vdots\\ SE(\beta_0) = \sigma\sqrt{\frac{1}{n}+\frac{\bar x^2}{\sum_i(x_i-\bar x)^2}} $$

better data → $\sigma^2$ goes down

more data → n goes up, SE goes down

larger coverage → $x_i$ goes up, SE goes down

Refinement for unknown noise variance

$$ \sigma \approx \sqrt{\sum \frac{(\hat f(x)-y_i)^2}{n-2}} $$

If indeed noise variance is unknown (not uncommon), it too needs to be estimated from the data.