Naive Bayes

by Justin Skycak on January 08, 2022

A simple classification algorithm grounded in Bayesian probability.

This post is part of the book Introduction to Algorithms and Machine Learning: from Sorting to Strategic Agents. Suggested citation: Skycak, J. (2022). Naive Bayes. In Introduction to Algorithms and Machine Learning: from Sorting to Strategic Agents. https://justinmath.com/naive-bayes/

Naive Bayes is a classification algorithm that is grounded in Bayesian probability. It involves choosing the class with the highest conditional probability given the corresponding data point, assuming that all the variables in the data set are independent from each other.

Deriving the Formula

The derivation of the main formula is shown below: we apply Bayes’ formula, discard the denominator $P(\text{point})$ since it doesn’t depend on the class, and then express $P(\text{point} \, \vert \, \text{class})$ as a product since we’re assuming that the variables in the point are independent.

$\begin{align*} \text{class} &= \underset{\text{class}}{\arg\max} \, \left[ P(\text{class} \, | \, \text{point}) \right] \\[5pt] &= \underset{\text{class}}{\arg\max} \, \left[ \dfrac{P(\text{point} \, | \, \text{class}) P(\text{class})}{P(\text{point})} \right] \\[5pt] &= \underset{\text{class}}{\arg\max} \, \left[ P(\text{point} \, | \, \text{class}) P(\text{class}) \right] \\[5pt] &= \underset{\text{class}}{\arg\max} \, \left[ \prod\limits_{\text{variables}} P(\text{variable} \, | \, \text{class}) P(\text{class}) \right] \\[5pt] &= \underset{\text{class}}{\arg\max} \, \left[ P(\text{class}) \prod\limits_{\text{variables}} P(\text{variable} \, | \, \text{class}) \right] \end{align*}$

The quantities in the final expression can be computed directly from our data set:

$P(\text{class})$ is the number of records in the class, divided by the total number of records.
$P(\text{variable} \, \vert \, \text{class})$ is the number of records in the class that have a matching variable value, divided by the total number of records in the class.

Example: Spam Detection

Let’s walk through a simple concrete example in which we apply the naive Bayes algorithm to the task of spam detection. Spam detection is the canonical example for naive Bayes because it was one of the first commercial successes of naive Bayes.

Suppose that you go through $10$ emails in your inbox and keep track of whether each email was a scam, along with whether it contained grammatical errors or links to other websites.

$\begin{align*} \begin{matrix} \textrm{Scam} & \textrm{Errors} & \textrm{Links} \\ \hline \textrm{No} & \textrm{No} & \textrm{No} \\ \textrm{Yes} & \textrm{Yes} & \textrm{Yes} \\ \textrm{Yes} & \textrm{Yes} & \textrm{Yes} \\ \textrm{No} & \textrm{No} & \textrm{No} \\ \textrm{No} & \textrm{No} & \textrm{Yes} \\ \textrm{Yes} & \textrm{Yes} & \textrm{Yes} \\ \textrm{No} & \textrm{Yes} & \textrm{No} \\ \textrm{No} & \textrm{No} & \textrm{Yes} \\ \textrm{Yes} & \textrm{Yes} & \textrm{No} \\ \textrm{No} & \textrm{No} & \textrm{Yes} \end{matrix} \end{align*}$

Now, you look at $4$ new emails. We can use the naive Bayes algorithm to predict whether each of these emails is a scam, based on whether it contains errors or links.

$\begin{align*} \begin{matrix} \textrm{Scam} & \textrm{Errors} & \textrm{Links} \\ \hline \textrm{?} & \textrm{No} & \textrm{No} \\ \textrm{?} & \textrm{Yes} & \textrm{Yes} \\ \textrm{?} & \textrm{Yes} & \textrm{No} \\ \textrm{?} & \textrm{No} & \textrm{Yes} \end{matrix} \end{align*}$

First, let’s consider the email with no errors and no links. We’ll start by writing down the naive Bayes classification formula for this specific situation:

$\begin{align*} \text{class} &= \underset{\text{class}}{\arg\max} \, \left[ P(\text{class}) \prod\limits_{\text{variables}} P(\text{variable} \, | \, \text{class}) \right] \\[5pt] &= \underset{\text{class}}{\arg\max} \, \left[ P(\text{class}) P(\text{no errors} \, | \, \text{class}) P(\text{no links} \, | \, \text{class}) \right] \\[5pt] \end{align*}$

So, we have two quantities to compare:

$\begin{align*} & P(\text{scam}) P(\text{no errors} \, | \, \text{scam}) P(\text{no links} \, | \, \text{scam}) \\[5pt] & \textrm{vs} \\[5pt] & P(\text{no scam}) P(\text{no errors} \, | \, \text{no scam}) P(\text{no links} \, | \, \text{no scam}) \end{align*}$

Let’s compute each of the probabilities in the above quantities:

Of the $10$ emails in the original data set, $4$ are scams and $6$ are not scams. Therefore, we have $\begin{align*} P(\text{scam}) = \dfrac{4}{10}, \quad P(\text{no scam}) = \dfrac{6}{10}. \end{align*}$
Of the $4$ emails in the original data set that are scams, $0$ have no errors and $1$ has no links. Therefore, we have $\begin{align*} P(\text{no errors} \, | \, \text{scam}) = \dfrac{0}{4}, \quad P(\text{no links} \, | \, \text{scam}) = \dfrac{1}{4}. \end{align*}$
Of the $6$ emails in the original data set that are not scams, $5$ have no errors and $3$ have no links. Therefore, we have $\begin{align*} P(\text{no errors} \, | \, \text{no scam}) = \dfrac{5}{6}, \quad P(\text{no links} \, | \, \text{no scam}) = \dfrac{3}{6}. \end{align*}$

Substituting these probabilities into the $2$ quantities we wish to evaluate, we get

$\begin{align*} & P(\text{scam}) P(\text{no errors} \, | \, \text{scam}) P(\text{no links} \, | \, \text{scam}) \\[5pt] &= \dfrac{4}{10} \cdot \dfrac{0}{4} \cdot \dfrac{1}{4} \\[5pt] &= 0 \\ \\ & \textrm{vs} \\ \\ & P(\text{no scam}) P(\text{no errors} \, | \, \text{no scam}) P(\text{no links} \, | \, \text{no scam}) \\[5pt] &= \dfrac{6}{10} \cdot \dfrac{5}{6} \cdot \dfrac{3}{6} \\[5pt] &= \dfrac{1}{4}. \end{align*}$

The “no scam” quantity gave us a greater value, so we predict that the email with no errors and no links is not a scam.

$\begin{align*} \begin{matrix} \underset{\textrm{Scam vs No Scam}}{\textrm{Quantity}} & \textrm{Scam} & \textrm{Errors} & \textrm{Links} \\ \hline 0 \,\, \textrm{vs} \,\, \frac{1}{4} & \textrm{No} & \textrm{No} & \textrm{No} \\ & \textrm{?} & \textrm{Yes} & \textrm{Yes} \\ & \textrm{?} & \textrm{Yes} & \textrm{No} \\ & \textrm{?} & \textrm{No} & \textrm{Yes} \end{matrix} \end{align*}$

We can predict the next row in the same way:

$\begin{align*} & P(\text{scam}) P(\text{errors} \, | \, \text{scam}) P(\text{links} \, | \, \text{scam}) \\[5pt] &= \dfrac{4}{10} \cdot \dfrac{4}{4} \cdot \dfrac{3}{4} \\[5pt] &= \dfrac{3}{10} \\[5pt] & \textrm{vs} \\ \\ & P(\text{no scam}) P(\text{errors} \, | \, \text{no scam}) P(\text{links} \, | \, \text{no scam}) \\[5pt] &= \dfrac{6}{10} \cdot \dfrac{1}{6} \cdot \dfrac{3}{6} \\[5pt] &= \dfrac{1}{20}. \end{align*}$

This time, the “scam” quantity gave us a greater value, so we predict that the email with errors and links is a scam.

$\begin{align*} \begin{matrix} \underset{\textrm{Scam vs No Scam}}{\textrm{Quantity}} & \textrm{Scam} & \textrm{Errors} & \textrm{Links} \\ \hline 0 \,\, \textrm{vs} \,\, \frac{1}{4} & \textrm{No} & \textrm{No} & \textrm{No} \\ \frac{3}{10} \,\, \textrm{vs} \,\, \frac{1}{20} & \textrm{Yes} & \textrm{Yes} & \textrm{Yes} \\ & \textrm{?} & \textrm{Yes} & \textrm{No} \\ & \textrm{?} & \textrm{No} & \textrm{Yes} \end{matrix} \end{align*}$

If we apply the naive Bayes algortithm to the remaining rows, we get the following results:

$\begin{align*} \begin{matrix} \underset{\textrm{Scam vs No Scam}}{\textrm{Quantity}} & \textrm{Scam} & \textrm{Errors} & \textrm{Links} \\ \hline 0 \,\, \textrm{vs} \,\, \frac{1}{4} & \textrm{No} & \textrm{No} & \textrm{No} \\ \frac{3}{10} \,\, \textrm{vs} \,\, \frac{1}{20} & \textrm{Yes} & \textrm{Yes} & \textrm{Yes} \\ \frac{1}{10} \,\, \textrm{vs} \,\, \frac{1}{20} & \textrm{Yes} & \textrm{Yes} & \textrm{No} \\ 0 \,\, \textrm{vs} \,\, \frac{1}{4} & \textrm{No} & \textrm{No} & \textrm{Yes} \end{matrix} \end{align*}$

Finally, note that in the event of a tie (i.e. both quantities give the same value), it is common to choose the class that occurred most frequently in the data set.

Exercise

Implement the example that was worked out above.