We now extend our logisitc regression model to support multiple classes. In this model we can have as many classes as we want (we will denote the number of classes with K) and we model the probability that a given data point belongs to class i as follows:
This is the softmax function and we'll see it a lot when we work on neural networks as it is generally used as the final layer of networks in classification problems (note that W is a weight matrix and B is a bias vector). We also take the \(\mathrm{argmax}\) (class with highest probability) as the classification that the model makes.
For this problem we will use the cross entropy loss function. We will use this as we are modelling the probability distributions of classes conditioned on a given data point. Similarly to BCE, cross entropy measures the average number of bits needed to identify an event drawn from the estimated distribution rather than the true distribution. This is again related to entropy from information theory. Cross entropy is defined as:
We can now implement it as is:
import numpy as np
def softmax(z):
z -= np.max(z, axis=1, keepdims=True) # for numerical stability
exp_z = np.exp(z)
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
def cross_entropy_loss(y_true, y_pred):
# y_true and y_pred shape: (m, K)
m = y_true.shape[0]
log_likelihood = -np.log(y_pred[range(m), np.argmax(y_true, axis=1)] + 1e-15)
return np.mean(log_likelihood)
def one_hot(y, num_classes):
return np.eye(num_classes)[y]
def softmax_regression(X, y, num_classes, lr=0.1, epochs=1000):
m, n = X.shape
y_encoded = one_hot(y, num_classes)
# Initialize weights and bias
W = np.zeros((num_classes, n))
b = np.zeros((num_classes, 1))
for epoch in range(epochs):
# Forward pass
z = X @ W.T + b.T # shape: (m, K)
y_pred = softmax(z) # shape: (m, K)
# Loss
if epoch % 100 == 0:
loss = cross_entropy_loss(y_encoded, y_pred)
print(f"Epoch {epoch}: Loss = {loss:.4f}")
# Gradients
dz = y_pred - y_encoded # (m, K)
dW = (dz.T @ X) / m # (K, n)
db = np.mean(dz, axis=0, keepdims=True).T # (K, 1)
# Update
W -= lr * dW
b -= lr * db
return W, b
def predict(X, W, b):
logits = X @ W.T + b.T
probs = softmax(logits)
return np.argmax(probs, axis=1)
In the colab notebook we then train our model on the iris dataset to predict the flower species from the features of length and width of the sepals and petals. The accuracy on the test dataset was 100% when a train-test split of 70-30 was used.
One really interesting real use case of this model was by Bill Benter who used it to predict the outcome of horse races. These predictions allowed him to make lots of money from gambling on the horse races!