Chapter 2.2: Softmax Regression

We now extend our logisitc regression model to support multiple classes. In this model we can have as many classes as we want (we will denote the number of classes with K) and we model the probability that a given data point belongs to class i as follows:

\(\hat{y}_{i}=P(y=i|x)=\frac{e^{z_{i}}}{\sum_{k=1}^{K}e^{z_{k}}}\), where \(i\in[1,K]\), \(Z=WX+B\) and \(z_k=w_{k}^{T}x_{k}+b_{k}\)

This is the softmax function and we'll see it a lot when we work on neural networks as it is generally used as the final layer of networks in classification problems (note that W is a weight matrix and B is a bias vector). We also take the \(\mathrm{argmax}\) (class with highest probability) as the classification that the model makes.

For this problem we will use the cross entropy loss function. We will use this as we are modelling the probability distributions of classes conditioned on a given data point. Similarly to BCE, cross entropy measures the average number of bits needed to identify an event drawn from the estimated distribution rather than the true distribution. This is again related to entropy from information theory. Cross entropy is defined as:

\(\mathcal{L}(Z)=-\sum_{i=1}^{K}y_{i}\log(\hat{y}_{i})\)

Now we can do the usual and take the gradients to get:

\(\begin{aligned}\nabla_{Z}\mathcal{L}(Z)&=\hat{Y}-Y\\ \nabla_{W}\mathcal{L}(Z)&=(\hat{Y}-Y)X^{T}\\ \nabla_{B}\mathcal{L}(Z)&=\hat{Y}-Y\end{aligned}\)

A derivation of the gradient with respect to \(Z\) can be found here. Recall that \(Z=WX+B\), so the gradients with respect to W and B can be found using chain rule:

\(\begin{align}\frac{\partial\mathcal{L}}{\partial W}&=\frac{\partial\mathcal{L}}{\partial Z}\frac{\partial Z}{\partial W}=(\hat{Y}-Y)X^{T}\\ \frac{\partial\mathcal{L}}{\partial B}&=\frac{\partial\mathcal{L}}{\partial Z}\frac{\partial Z}{\partial B}=\hat{Y}-Y\end{align}\)

Now that the gradients are known we have to find the minimum. There is no closed form solution to this problem when we set the gradients to zero so we can again use the gradient descent algorithm:

\(\begin{align}W^{(t+1)}&=W^{(t)}-\eta\nabla_{W}\mathcal{L}(Z^{(t)})\\ B^{(t+1)}&=B^{(t)}-\eta\nabla_{B}\mathcal{L}(Z^{(t)})\end{align}\)

We can now implement it as is:


    import numpy as np

    def softmax(z):
        z -= np.max(z, axis=1, keepdims=True)  # for numerical stability
        exp_z = np.exp(z)
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)

    def cross_entropy_loss(y_true, y_pred):
        # y_true and y_pred shape: (m, K)
        m = y_true.shape[0]
        log_likelihood = -np.log(y_pred[range(m), np.argmax(y_true, axis=1)] + 1e-15)
        return np.mean(log_likelihood)

    def one_hot(y, num_classes):
        return np.eye(num_classes)[y]

    def softmax_regression(X, y, num_classes, lr=0.1, epochs=1000):
        m, n = X.shape
        y_encoded = one_hot(y, num_classes)

        # Initialize weights and bias
        W = np.zeros((num_classes, n))
        b = np.zeros((num_classes, 1))

        for epoch in range(epochs):
            # Forward pass
            z = X @ W.T + b.T      # shape: (m, K)
            y_pred = softmax(z)    # shape: (m, K)

            # Loss
            if epoch % 100 == 0:
                loss = cross_entropy_loss(y_encoded, y_pred)
                print(f"Epoch {epoch}: Loss = {loss:.4f}")

            # Gradients
            dz = y_pred - y_encoded                # (m, K)
            dW = (dz.T @ X) / m                    # (K, n)
            db = np.mean(dz, axis=0, keepdims=True).T  # (K, 1)

            # Update
            W -= lr * dW
            b -= lr * db

        return W, b

    def predict(X, W, b):
        logits = X @ W.T + b.T
        probs = softmax(logits)
        return np.argmax(probs, axis=1)

In the colab notebook we then train our model on the iris dataset to predict the flower species from the features of length and width of the sepals and petals. The accuracy on the test dataset was 100% when a train-test split of 70-30 was used.

One really interesting real use case of this model was by Bill Benter who used it to predict the outcome of horse races. These predictions allowed him to make lots of money from gambling on the horse races!