Formula of line: $\hat{Y} = \hat{f}(X) = \hat{\beta_1}X + \hat{\beta_0}$
The goal is to find the regression coefficient. In fact, all training simply amounts to find theses coefficient. Main approach is to find the coefficients that will result in the smallest MSE.
we can replace $\hat y$ with the linear model and use the MSE as the loss function
$$ L(\beta_0,\beta_1)= \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2 = \frac{1}{n}\sum_{i=1}^n[y_i-({\beta_1}X + {\beta_0})]^2 $$
Then the optimal values for $\beta_0, \beta_1$ is $\text{argmin}L(\beta_0,\beta_1)$
Brute force - try all combinations
Exact analytical solution - solve a system of equations
$\hat{\beta_1} = \frac{\sum_i(x_i-\bar x)(y_i-\bar y)}{\sum_i(x_i-\bar x)^2}$, $\hat\beta_0=\bar y - \hat\beta_1\bar x$
Gradient Descent - use gradient to find the maxima
Often, we use multiple predictors
$Y=y_1,..., y_n, \;X=X_1,...,X_j$
The model takes a simple algebraic form: $\boldsymbol Y = \boldsymbol X\beta + \epsilon$ where $\boldsymbol Y$ and $\beta$ are vectors and $\boldsymbol X$ is a matrix
Thus, the MSE can be expressed in vector notation as $MSE(\beta) = \frac{1}{n}\lVert\boldsymbol Y - \boldsymbol X\beta \rVert^2$
Minimizing the MSE using vector calculus yields $\boldsymbol{\hat \beta} = (\boldsymbol{X^TX})^{-1}\boldsymbol{X}^T \boldsymbol Y = \underset{\beta}{\text{argmin}}MSE(\boldsymbol\beta)$