Chapter 1.1: Linear Regression

One of the simplest models for regression is the linear model. In this model we will have p independent variables and one dependent variable y. In the problem setup we are given n observations labelled $X_{i}=(X_{i1}, X_{i2},...,X_{in})$ ($i=1,...,n$) with a corresponding dependent variable $Y_{i}$. The model describes a hyperplane given by the equation:

$f(X;\theta)=Y_i=\beta_0+\beta_1X_{i1}+...+\beta_{p}X_{ip}$

Here $\beta=(\beta_{0},...,\beta_{n})$ are constants that will have to be worked out so the model fits the data well. We can do this by applying out knowledge of regression. For this specific problem we use least squares as our lost function (same as MSE but we don't divide by n). So we want to solve:

$\hat{\beta}=\underset{\beta}{\mathrm{argmin}}\hspace{0.09cm}\sum_{i=0}^{n}\left(\beta X_{i}-Y_{i}\right)^{2}$

Note for this to hold true to our model we impute the data matrix with a column of ones so that $\beta_0 * X_{0}=\beta_{0} * 1=\beta_{0}$, i.e. we get our constant. We can rewrite our loss function to the following (note $X$ is the data matrix):

$$\begin{aligned} \sum_{i=0}^{n}\left(\beta X_{i}-Y_{i}\right)^{2} &= \lVert X\beta-Y\rVert^{2}\\ &= (X\beta-Y)^{T}(X\beta-Y) \\ &= Y^{T}Y-Y^{T}X\beta-\beta^{T}X^{T}Y+\beta^{T}X^{T}X\beta\end{aligned}$$

The next step in the optimisation is to take the gradient of this with respect to $\beta$. Doing this we get:

$-2X^TY+2X^TX\beta$

$\hat{\beta}=(X^{T}X)^{-1}X^{T}Y$

Note that our loss function is convex so our solution is the minimum and it's unique. We can now move onto coding up our solution and training the model.

In the above code we train the model on the diabetes dataset, a dataset that has 10 independent variables and one dependent we are trying to predict. To do this we define a function for adding a column of ones onto the data matrix, an error function to calculate the value of our loss function and a function to solve the optimisation problem we solved above.

Running the code we get a high error of approximately 1263985. This means linear regression probably wasn't the best model to use for making predictions on this dataset. Also in the colab book is a code sement for viewing plots of the data and prediction lines. When we plot them out we see that the data in this case has a wide spread (high variance) and thus a linear model is unable to capture that.