Chapter 2.4: Quadratic Disciminant Analysis

QDA is very similar to LDA except we now drop the assumption that all classes share the same covarience matrix (each class has a different covariance matrix). Again in the setup of QDA we are given a set of samples \(x_{1},...,x_{n}\) and class labels \(y_{1},...,y_{n}\) (there are K different classes). It again also assumes that all class-conditional distributions (\(p(x|y=k)\)) are distributed according to a multivariate gaussian (as seen below, also note that \(|\cdot|\) is the determinant) and each class has a prior probability \(\pi_{k}=\frac{n_{k}}{n}\), where \(n_{k}\) is the number of samples from class k.

\(p(x|y=k)=\frac{1}{(2\pi)^{d/2}|\Sigma_{k}|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu_{k})^{T}\Sigma_{k}^{-1}(x-\mu_{k})\right)\)

We can now find the probability of a sample belonging to a class using bayes rule:

\(p(y=k|x)=\frac{p(x|y=k)\pi_{k}}{\sum_{j=1}^{K}p(x|y=j)\pi_{j}}\)

As the denominator is the same for all classes the class with the highest numerator will be chosen (we are maximising the posterior probability). Taking the log of the numerator (as its easier to work with) we get our discriminant function:

\(\begin{align}\delta_{k}(x)&=\log p(x|y=k)+\log\pi_{k}\\ &=-\frac{1}{2}\log|\Sigma_{k}|-\frac{d}{2}\log(2\pi)-\frac{1}{2}(x-\mu_{k})^T\Sigma_{k}^{-1}(x-\mu_{k})+\log\pi_{k}\end{align}\)

We see that the term \(\frac{d}{2}\log(2\pi)\) is constant so it will be the same for all classes, hence we can drop it in our discriminant function giving us:

\(\delta_{k}(x)=-\frac{1}{2}\log|\Sigma_{k}|-\frac{1}{2}(x-\mu_{k})^T\Sigma_{k}^{-1}(x-\mu_{k})+\log\pi_{k}\)

This is our disciminant function and we now see why its called quadratic discriminant analysis as compared to LDA we retain the quadratic term. For classification we now just take the class that gives the highest value of the discriminant function.

\(\hat{y}=\arg\max_{k}\delta_{k}(x)\)

Now to the coding section of this tutorial. In this code we have a function that fits our model by calculating all the means, covariance matrices and prior probabilities. We also have a prediction function which takes these values and plugs them into the discriminant equation and returns the maximum as the prediction.


import numpy as np

def qda_fit(X, y):
    """
    Estimate the parameters for QDA.
    
    Parameters:
        X : numpy.ndarray of shape (n_samples, n_features)
            Feature matrix.
        y : numpy.ndarray of shape (n_samples,)
            Class labels.

    Returns:
        means : dict
            Dictionary mapping class label to class mean vector.
        covariances : dict
            Dictionary mapping class label to class covariance matrix.
        priors : dict
            Dictionary mapping class label to class prior probability.
    """
    classes = np.unique(y)
    means = {}
    covariances = {}
    priors = {}

    for cls in classes:
        X_k = X[y == cls]
        means[cls] = np.mean(X_k, axis=0)
        covariances[cls] = np.cov(X_k, rowvar=False)
        priors[cls] = X_k.shape[0] / X.shape[0]

    return means, covariances, priors


def qda_predict(X, means, covariances, priors):
    """
    Predict class labels using QDA.
    
    Parameters:
        X : numpy.ndarray of shape (n_samples, n_features)
            Feature matrix.
        means : dict
            Dictionary of class mean vectors.
        covariances : dict
            Dictionary of class covariance matrices.
        priors : dict
            Dictionary of class prior probabilities.

    Returns:
        y_pred : numpy.ndarray of shape (n_samples,)
            Predicted class labels.
    """
    n_samples = X.shape[0]
    classes = list(means.keys())
    y_pred = np.empty(n_samples, dtype=object)

    for i in range(n_samples):
        x = X[i]
        scores = {}
        for cls in classes:
            mean = means[cls]
            cov = covariances[cls]
            prior = priors[cls]

            # Compute discriminant function
            try:
                inv_cov = np.linalg.inv(cov)
                det_cov = np.linalg.det(cov)
            except np.linalg.LinAlgError:
                raise ValueError(f"Covariance matrix for class {cls} is singular.")

            diff = x - mean
            log_likelihood = -0.5 * np.log(det_cov) - 0.5 * diff.T @ inv_cov @ diff
            log_prior = np.log(prior)
            scores[cls] = log_likelihood + log_prior

        # Predict class with highest score
        y_pred[i] = max(scores, key=scores.get)


    return [int(x) for x in y_pred]
  

Training this on the iris dataset we get an improved accuracy over LDA (98% compared to 84.44%). This is beacuse the decision boundaries in QDA are quadratic so non-linearities in the data are captured.