Predicting a variable

Let’s consider a scenario where we seek to predict the value of some variable based on the values of other variable(s).

Predictor Variable and Response Variable

Predictor variable(features, covariates, $X$): variables whose values we use to make our prediction

Response variable(outcome, dependent variable, $Y$): variables whose value we want to predict

True vs Statistical Model

We will assume that the response variable, Y, relates to the predictors, X, through some function f expressed generally as $Y= f(X) + \epsilon$

Here, f is the unknown function expressing an underlying rule for relating Y to X, $\epsilon$ is the random amount that Y differs from the rule f(X)

A statistical model is any algorithm that estimates f, and it is denoted as $\hat{f}$

Prediction vs Estimation

Depend on the nature of the problem, we may or may not seek to estimate f explicitly

In inference problems, we seek to build this function $\hat{f}$ explicitly, from which we then compute the response

In prediction problems, we don’t care to build this function, but rather to make out predictions $\hat{y}$ as close to the observed values $y$ as possible

Simple Prediction Models

K-Nearest Neighbor

The K**-Nearest Neighbor(KNN) model** is an intuitive way to predict a quantitative response variable: to predict a response for a set of observed predictor values, we use the responses of other observations most similar to it

Most similar how? Use a notion of distance (Euclidean, Manhattan, etc) for numerical values

Use majority rule for categorical values