Chapter 2.5: Support Vector Machines

We are now going to cover a once very popular machine learning algorithm SVM that used to have competitive performance with neural networks, but has since been replace by the latter due to increases in computing power and better neural network models. It is a binary linear classifier but can be extended to multiclass problems by training an SVM on the pairwise combination of all different classes and choosing the class that is predicted the most out of all the models. Our model with weights and biases is:

\(w^Tx+b=0\)

As it is a binary classifier we will have two outputs:

\(\begin{align}w^Tx+b&=1\\ w^Tx+b&=-1\end{align}\)

This hyperplane describes a decision boundary between the classes. In SVM we attempt to find the "maximum margin hyperplane", i.e. the hyperplane from which the distance between the hyperplane and the closest point from either class is maximised. We call the distance between the hyperplane and these points the margin and its defined as \(\frac{2}{\Vert w\Vert}\). so our classification rule is:

\(\begin{align}w^Tx_{i}+b&\geq1 \text{ if } y_{i}=1\\ w^Tx_{i}+b&\leq-1 \text{ if } y_{i}=-1\end{align}\)

We can rewrite this decision rule as:

\(y_{i}(w^Tx+b)\geq1\) for all \(1\leq i \leq n\)

Above is the case for when we have data that can be seperated by a hyperplane (linearly seperable data), however in practice this is unlikely to occur so we extend the model to take into account violations (the error) between the seperating hyperplane and the incorrect prediction.

\(y_{i}(w^Tx+b)\geq1-\zeta_{i}\)

Here \(\zeta_{i}\) is our error and if \(\zeta_{i}=0\) we consider it correctly classified, if \(0<\zeta_{i}<1\) it is correctly classified but inside the margin and if \(\zeta_{i}>1\) it is incorrectly classified. From this we can find the parameters of the model by solving the following optimisation problem:

\(\arg\min_{w,b,\zeta}\frac{1}{2}\Vert w \Vert^{2}+C\sum_{i=1}^{n}\zeta_{i}\) subject to \(y_{i}(w^Tx+b)\geq1-\zeta_{i}\)

This is known as the primal formulation of the optimisation problem. Here the sum is used to penalise violations. We have C as a regularisation parameter that controls the "hardness" of the margin (large C has less tolerance for margin violations whilst small C has more tolerance). Another trick we can use to help with data that isn't linearly seperable is the kernel trick. We map our data to a higher dimension via a function that will help make it linearly seperable in the higher dimension. In general our kernal function is defined as a dot product of a function \(\phi\) applied to the data:

\(K(x,y)=\phi(x)^{T}\phi(y)\)

Some common kernels are defined below:

\(\begin{align}K(x,y)&=x^{T}y\text{, Linear Kernel}\\ K(x,y)&=(x^Ty+c)^{d}\text{, Polynomial Kernel}\\ K(x,y)&=\exp{(-\gamma\Vert x-y\Vert^{2})}\text{, RBF (Gaussian) Kernel}\end{align}\)

In order to apply the Kernel trick we can apply the kernel to the dual optimisation problem (equivalent optimisation problem, note the dual is lagrangian and \(\alpha_{i}\) are lagrange multipliers). Also note that this kernel trick can be applied to any linear classifier.

\(max_{\alpha}\sum_{i=1}^{n}\alpha_{i}-\frac{1}{2}\sum_{i,j}\alpha_{i}\alpha_{j}y_{i}y_{j}K(x_{i},x_{j})\) subject to \(0\leq\alpha_{i}\leq C, \sum_{i}\alpha_{i}y_{i}=0\)

We also now wee what the support vectors are, they are the points that touch the margin line, i.e. \(y_{i}(w^{T}x_{i}+b)=1\), these also correspond to the non-zero lagrange multipliers (\(\alpha_{i}\neq0\)). From this we can see that our weights and biases are found through:

\(w=\sum_{i}\alpha_{i}y_{i}x_{i}\)

\(b=\left[\sum_{j}\alpha_{j}y_{i}K(x_{j},x_{i})\right]-y_{i}\)

The decision function also becomes:

\(\text{sgn}\left(\left[\sum_{i=1}^{n}\alpha_{i}y_{i}K(x_{i}, z)\right]\right)\)

Note here that \(x_{i}\) are the support vectors and \(z\) is the new point we want to classify. There are numerous methods for solving either the primal or dual form of SVM (most notably quadratic programming methods), however we will focus specifically on PEGASOS a sub-gradient method for solving the primal form. We will first focus on the linear form which in PEGASOS is defined slightly different to what we saw above:

\(\min_{w}\frac{\lambda}{2}\Vert w\Vert^{2}+\frac{1}{n}\sum_{i=1}^{n}\max(0,1-y_{i}w^{T}x_{i})\)

Here \(\lambda\) is the regularisation parameter (inverse of C in the typical formulation) and the second term is known as the hinge loss (loss function used in this formulation of SVM). For this algorithm we first randomly form T mini batches of the data (for t=1,2,...,T). We then randomly sample an in index \(i_{t}\) and use this batch for a single step of PEGASOS. As this is a form of stochastic gradient descent we need a learning rate which is set as \(\eta_{t}=\frac{1}{\lambda t}\). Our subgradient is:

\(\nabla f(w)=\begin{cases}\lambda w-yx & \text{if } yw^{T}x<1\\ \text{else }\lambda w\end{cases}\)

We then update \(w\) according to the usual gradient descent algorithm:

\(w_{t+1}=w_{t}-\eta_{t}\nabla f(w)\)

There is also another step that prevents the norm of \(w\) from blowing up (constrains \(w\) so that we can ensure convergence):

\(w=\min\left(1,\frac{1}{\sqrt{\lambda}\Vert w\Vert}\right)w\)

As we want our algorithm to consider Kernel functions we extend it somewhat to the dual formulation (as it is able to optimise the kernel function). Through chain rule our update becomes:

\(w_{t+1}=(1-\eta_{t}\lambda)w_{t}+\eta_{t}t_{t}\phi(x_{t})\)

Now if we evaluate our decision function along with the hinge loss condition as such \(y_{t}\times\)decision function\(<1\), we can add \(x_{t}\) to the support vector set (the vector is within the margin). It can also be shown that in this case our lagrange multipliers are \(\alpha_{t}=\frac{1}{\lambda t}\). So as we are looking at the dual problem we instead of updating our weights update our support vector set and our lagrange multipliers at each step. If we look back to our decision function we see that is all we have to find. We now code this up:


import numpy as np
from collections import defaultdict

# ---------------------------
# Kernel functions
# ---------------------------
def linear_kernel(x, y):
    return np.dot(x, y)

def polynomial_kernel(x, y, degree=3, coef0=1):
    return (np.dot(x, y) + coef0) ** degree

def rbf_kernel(x, y, gamma=0.05):
    return np.exp(-gamma * np.linalg.norm(x - y) ** 2)

# ---------------------------
# PEGASOS Kernel SVM (Binary)
# ---------------------------
class PegasosKernelSVM:
    def __init__(self, kernel='rbf', lambda_param=0.01, max_iter=1000, batch_size=1, **kernel_params):
        self.lambda_param = lambda_param
        self.max_iter = max_iter
        self.batch_size = batch_size
        self.kernel_name = kernel
        self.kernel_params = kernel_params

        self.kernel = self._get_kernel(kernel)
        self.support_vectors = []
        self.support_labels = []
        self.alphas = []

    def _get_kernel(self, name):
        if name == 'linear':
            return linear_kernel
        elif name == 'polynomial':
            return lambda x, y: polynomial_kernel(x, y, **self.kernel_params)
        elif name == 'rbf':
            return lambda x, y: rbf_kernel(x, y, **self.kernel_params)
        else:
            raise ValueError(f"Unknown kernel: {name}")

    def fit(self, X, y):
        n_samples = X.shape[0]

        for t in range(1, self.max_iter + 1):
            # Mini-batch
            idx = np.random.choice(n_samples, self.batch_size, replace=False)
            X_batch = X[idx]
            y_batch = y[idx]

            for xi, yi in zip(X_batch, y_batch):
                # Compute decision function
                if len(self.support_vectors) == 0:
                    decision = 0
                else:
                    decision = sum(
                        alpha * y_sv * self.kernel(sv, xi)
                        for alpha, sv, y_sv in zip(self.alphas, self.support_vectors, self.support_labels)
                    )

                # Margin check
                if yi * decision < 1:
                    alpha_t = 1 / (self.lambda_param * t)
                    self.support_vectors.append(xi)
                    self.support_labels.append(yi)
                    self.alphas.append(alpha_t)

    def project(self, X):
        result = np.zeros(X.shape[0])
        for i, x in enumerate(X):
            result[i] = sum(
                alpha * y_sv * self.kernel(sv, x)
                for alpha, sv, y_sv in zip(self.alphas, self.support_vectors, self.support_labels)
            )
        return result

    def predict(self, X):
        return np.sign(self.project(X))

class MultiClassPegasosSVM:
    def __init__(self, kernel='rbf', lambda_param=0.01, max_iter=1000, batch_size=1, **kernel_params):
        self.kernel = kernel
        self.lambda_param = lambda_param
        self.max_iter = max_iter
        self.batch_size = batch_size
        self.kernel_params = kernel_params
        self.classifiers = {}

    def fit(self, X, y):
        self.classes = np.unique(y)
        for cls in self.classes:
            # Convert to binary problem: +1 vs -1
            y_binary = np.where(y == cls, 1, -1)
            clf = PegasosKernelSVM(
                kernel=self.kernel,
                lambda_param=self.lambda_param,
                max_iter=self.max_iter,
                batch_size=self.batch_size,
                **self.kernel_params
            )
            clf.fit(X, y_binary)
            self.classifiers[cls] = clf

    def predict(self, X):
        scores = {}
        for cls, clf in self.classifiers.items():
            scores[cls] = clf.project(X)
        # Pick class with highest score
        score_matrix = np.vstack([scores[c] for c in self.classes])
        predictions = self.classes[np.argmax(score_matrix, axis=0)]
        return predictions

Training this on the iris dataset with a gaussian kernel we get 100% accuracy on our test set.