Chapter 1.3: Elastic Net

We can extend our linear model in a different way then we did in the previous tutorial by adding L1 (lasso regression) and L2 (ridge regression) regularisation to the loss function. Our loss function then becomes:

\(\mathcal{L}(\beta)=\frac{1}{m}\left \lVert Y-X\beta \right \rVert^{2} + \lambda(\alpha\left \lVert \beta \right \rVert_{1} + (1-\alpha) \left \lVert \beta \right \rVert^{2})\)

This is known as the elastic net (note we are now using MSE as a loss hence the \(\frac{1}{m}\) term, with m being the number of samples). Here \(\lambda>0\) is a hyperparameter that describes how much we want the regularisation to influence our model and \(\alpha\in[0,1]\) is a term that balances between L1 and L2 regularisation. When \(\alpha=1\) we get just the L1 term so it is equivalent to lasso regression, similarly when \(\alpha=0\) we get just the L2 term so it is equivalent to ridge regression.

In terms of an explict solution none exists except when \(\alpha=0\) due to the L1 regularisation (absolute value function is not differentiable at 0). We can overcome this by using subgradients for the L1 regularisation term and using the gradient descent algorithm to find a solution (for absolute value the subgradient is the sign function hence we can't rearrange for \(\beta\) when we set our gradient to 0). As the loss function is still convex this solution will be unique. The gradient of the loss function is:

\(\nabla_{\beta}\mathcal{L}(\beta)=-\frac{2}{m}X^{T}(Y-X\beta)+\lambda\alpha\operatorname{sgn}(\beta)+2\lambda(1-\alpha)\beta\)

When \(\alpha=0\) the explict solution is:

\(\hat{\beta}=(X^{T}X+\lambda I)^{-1}X^{T}Y\)

I won't derive the solution here, I'll leave that as an exercise but it's very similar to the derivation for standard linear regression. With our gradient we can find the minimum through the usual update rule for gradient descent:

\(\beta^{(t+1)}=\beta^{(t)}-\eta\nabla_{\beta}\mathcal{L}(\beta^{(t)})\)

Where \(\eta\) is the learning rate. Coding up our solution (making sure to calculate the explict solution when \(\alpha=0\)) we have:


    import numpy as np

    ratio = 0.8
    strength = 0.2

    def linear_model(X, beta):
        return X @ beta


    def elastic_net_loss(X, y, beta, l1_ratio=ratio, alpha=strength):
        """
        Compute the Elastic Net loss:
        L(beta) = ||y - X beta||^2 + lambda * (alpha * ||beta||_1 + (1 - alpha) * ||beta||^2)

        Parameters:
            X : ndarray (n_samples, n_features)
            y : ndarray (n_samples,)
            beta : ndarray (n_features,)
            l1_ratio : float, between 0 and 1 (mixing ratio between L1 and L2)
            alpha : float, regularization strength

        Returns:
            loss : float
        """
        # Residual
        residual = y - X @ beta

        # Squared error
        mse_term = np.sum(residual ** 2)

        # L1 and L2 penalties
        l1_term = np.sum(np.abs(beta))
        l2_term = np.sum(beta ** 2)

        # Elastic Net loss
        loss = mse_term + alpha * (l1_ratio * l1_term + (1 - l1_ratio) * l2_term)
        return loss


    def elastic_net(X, y, alpha=strength, l1_ratio=ratio, lr=0.01, epochs=1000):
        """
        Elastic Net Regression using NumPy.
        
        Parameters:
            X : ndarray of shape (n_samples, n_features)
            y : ndarray of shape (n_samples,)
            alpha : float, regularization strength (λ)
            l1_ratio : float, mixing between L1 and L2 regularization (α)
                    - l1_ratio = 0 => Ridge
                    - l1_ratio = 1 => Lasso
            lr : float, learning rate (used only if alpha > 0)
            epochs : int, number of iterations for gradient descent (if alpha > 0)

        Returns:
            beta : ndarray of shape (n_features,), estimated coefficients
        """

        m, n = X.shape

        # Special case: if l1_ratio = 0, use closed-form Ridge solution
        if l1_ratio == 0:
            I = np.eye(n)
            beta = np.linalg.inv(X.T @ X + alpha * I) @ X.T @ y
            return beta

        # Otherwise, use gradient descent
        beta = np.zeros(n)
        lambda1 = alpha * l1_ratio      # L1 penalty
        lambda2 = alpha * (1 - l1_ratio)  # L2 penalty

        for _ in range(epochs):
            y_pred = X @ beta
            grad_mse = -2 * X.T @ (y - y_pred) / m
            grad_l2 = 2 * lambda2 * beta
            grad_l1 = lambda1 * np.sign(beta)
            grad = grad_mse + grad_l2 + grad_l1
            beta -= lr * grad

        return beta
  

In the linked colab file above we train it on the wine dataset to predict alcohol content of wine.

This is all nice and good but why have we introduced regularisation to our linear regression model? There are a few reasons for doing this, the first being to deal with multicollinearity between variables, which means it is difficult to determine certain independent variables individual effect on the dependent variable. This can lead to unreliable coeffecient estimates in the model. The other main reason to do this is to introduce sparsity into our models (lots of 0 coeffecients). This aids in saving memory (don't need to store 0's) and lessening the amount of operations we need to do when we evaluate our model. It reduces the operations because in the dot product / matrix multiplication (\(X\beta\)) all the 0 coeffecients will multiply with the data to give 0 so we don't need a computer to calculate them. Below is a table that compares the different forms of regularisation.

Property Ridge Regression Lasso Regression Elastic Net
Penalty Type L2 (\( \lambda \|\beta\|^2 \)) L1 (\( \lambda \|\beta\|_1 \)) L1 + L2 (\( \lambda (\alpha \|\beta\|_1 + (1 - \alpha) \|\beta\|^2) \))
Sparsity Does not produce sparse models Encourages sparsity (feature selection) Can produce sparsity depending on \( \alpha \)
Multicollinearity Handling Handles well by shrinking coefficients Poor handling, may arbitrarily select one feature Handles better than Lasso due to L2 component
When to Use Many small/medium effects; no need for feature selection Only a few features matter; need variable selection Mix of Lasso and Ridge behavior; flexible
Hyperparameters \( \lambda \) \( \lambda \) \( \lambda \), \( \alpha \in [0, 1] \)
Closed-form Solution ✅ Yes ❌ No (requires optimization) ❌ No (requires optimization)

One final thing to note is that these regularisation methods aren't exclusive to linear regression. We can add these terms to any other models we may encounter.