Chapter 2.5: Support Vector Machines

Chapter 2.6: Naive Bayes

The naive bayes classifier is a simple classifier based on bayes theorem. As this is a classification problem we use Bayes theorem as follows:

\(P(C|X)=\frac{P(X|C)P(C)}{P(X)}\)

Where X is the data we observe and C is the class. P(C) is the prior probability of class C, i.e. the number of instances we have seen of class C divided by the number of data (same as \(\pi\) in discriminant analysis). We also make a naive assumption (hence the name of the classifier) that:

\(P(X_{1},X_{2},...,X_{n}|C)=P(X_{1}|C)P(X_{1}|C)...P(X_{1}|C)\)

And hence the classifier becomes:

\(P(C|X)\propto P(C)\prod_{i=1}^{n}P(X_{i}|C)\)

Note P(X) is the same for all classes so we can ignore it. We also choose the class based on which class produces the highest probability in the above expression. Depending on the type of classification we are doing we can calculate \(P(X_{i}|C)\) differently. Three common methods are below. They are Bernoilli, Multinomial and Gaussian respectively.

\(\begin{align}P(X_{i}=1|C)&=\frac{\text{number of samples in class C where }X_{i}=1}{\text{total number of samples in class C}}\\ P(X_{i}|C)&=\frac{\text{count}(X_{i}\text{ in class }C)}{\sum_{j=1}^{n}\text{count}(X_{j}\text{ in class }C)}\\ P(X_{i}|C)&=\frac{1}{\sqrt{2\pi\sigma_{C,i}^{2}}}\exp\left(-\frac{(X_{i}-\mu_{C,i})^{2}}{2\sigma_{C,i}^{2}}\right)\end{align}\)

Note here that \(\mu_{C,i}\) and \(\sigma_{C,i}\) are the class means and variance respectively. Also note for Bernoilli \(P(X_{i}=0|C)=1-P(X_{i}=1|C)\). We now code up each of the classifiers below:

# ======== GAUSSIAN NAIVE BAYES ========= class GaussianNB_Scratch(NaiveBayesBase): def fit(self, X, y): self.classes = np.unique(y) self.priors = {} self.mean = {} self.var = {} for c in self.classes: X_c = X[y == c] self.priors[c] = X_c.shape[0] / X.shape[0] self.mean[c] = np.mean(X_c, axis=0) self.var[c] = np.var(X_c, axis=0) def gaussian_prob(self, x, mean, var): eps = 1e-9 coeff = 1.0 / np.sqrt(2.0 * np.pi * var + eps) exponent = -((x - mean)**2) / (2 * var + eps) return coeff * np.exp(exponent) def predict(self, X): preds = [] for x in X: posteriors = {} for c in self.classes: prior = np.log(self.priors[c]) likelihoods = np.log(self.gaussian_prob(x, self.mean[c], self.var[c])) posteriors[c] = prior + np.sum(likelihoods) preds.append(max(posteriors, key=posteriors.get)) return np.array(preds) # ======== MULTINOMIAL NAIVE BAYES ========= class MultinomialNB_Scratch(NaiveBayesBase): def fit(self, X, y): self.classes = np.unique(y) self.priors = {} self.likelihood = {} self.class_totals = {} self.vocab_size = X.shape[1] for c in self.classes: X_c = X[y == c] self.priors[c] = X_c.shape[0] / X.shape[0] word_counts = np.sum(X_c, axis=0) self.likelihood[c] = (word_counts + 1) / (np.sum(word_counts) + self.vocab_size) # Laplace smoothing def predict(self, X): preds = [] for x in X: posteriors = {} for c in self.classes: prior = np.log(self.priors[c]) likelihoods = np.sum(x * np.log(self.likelihood[c])) posteriors[c] = prior + likelihoods preds.append(max(posteriors, key=posteriors.get)) return np.array(preds) # ======== BERNOULLI NAIVE BAYES ========= class BernoulliNB_Scratch(NaiveBayesBase): def fit(self, X, y): self.classes = np.unique(y) self.priors = {} self.feature_probs = {} for c in self.classes: X_c = X[y == c] self.priors[c] = X_c.shape[0] / X.shape[0] feature_presence = np.sum(X_c, axis=0) self.feature_probs[c] = (feature_presence + 1) / (X_c.shape[0] + 2) # Laplace smoothing def predict(self, X): preds = [] for x in X: posteriors = {} for c in self.classes: prior = np.log(self.priors[c]) likelihoods = x * np.log(self.feature_probs[c]) + (1 - x) * np.log(1 - self.feature_probs[c]) posteriors[c] = prior + np.sum(likelihoods) preds.append(max(posteriors, key=posteriors.get)) return np.array(preds)

We then train these models on Iris dataset for Gaussian, tiny text dataset for multinomial and tiny binary text for Bernoulli. We get accuracies of 97.7%, 100% and 100% respectively.