One of the main goals of statistics is to make predictions from data. In a supervised setting we are given some data \(\tilde{x}\) and wish to marke predictions \(\tilde{y}\) from that data. For example we could be given the number of hours students studied (\(\tilde{x}\)) and wish to predict from that their exam score (\(\tilde{y}\), this is given in the data we use, we want our model to predict this value from \(\tilde{x}\)). One of the ways to do this is to assume your data roughly follows a mathematical function that has parameters (which we will denote \(\theta\)) that can be adjusted to make the function better fit the data. Consider the following example, when plotted your data might look like it follows a line \(f(x;\theta)=mx+c\), here \(\theta=(m,c)^{T}\). Try playing with the parameters of the function in the graph below to find values of m and c that cause the line to be as close to the data points as possible.
The next thing to consider is how good a given function "fits" the data. Is the values the function predicts close to or far away from the data we've seen? To measure this we use what is known as a loss function denoted \(\ell(f(\tilde{x};\theta),\tilde{y})\). This function provides us with information as to how far away our model \(f(\tilde{x};\theta)\) is from the data \(\tilde{y}\). Generally we also want this function to output a scalar value and have a minimum of zero (when the function at each input is equal to the data). The most commonly used loss function is the mean squared error:
The mean square error is the most commonly used loss function because it has a lot of nice properties. Firstly it is related to the euclidean distance (as seen in the above equation). In this equation we can see that as the distance between the predictions and the data increase so will the Euclidean distance, thus it is a direct measure of distance. It also has nice statistical properties as it is directly related to the variance and bias of the estimator.
A proof of this fact can be found on wikipedia here. This is an important property as in our models we seek to have a model that accurately captures the patterns in the data (low variance) but also generalises well to new data the model wasn't trained on (low bias). From the relation above we can see that minimising the mean squared error minimises both the variance and the bias which will generally lead to an ideal model. This concept of variance and bias is important to all models we train in machine learning as we always want our models to generalise to new data well whilst predicting known data accurately (this is known as the bias-variance trade off).
The mean square error is not the only loss function as we shall see in further tutorials. Depending on the problem we are solving we may wish to define other loss functions that will measure the "loss" better than the mean square error in certain cases.
As we seek to find the parameters that minimise the loss function we can define regression as the following optimisation problem:
From our knowledge of optimisation we know that this problem can be solved by taking the derivative (or gradient in the multivariable case) of \(\ell(f(\tilde{x};\theta),\tilde{y})\) with respect to \(\theta\) and setting to 0 (or the zero vector in the multivariable case) and solving for the parameters. So the main steps in performing regression are: