Want to master Data Science and Machine Learning?

Breaking Down Bayes' Theorem

Dylan | Jun 17, 2019

Post Thumbnail

Bayes’ theorem is perhaps the most important theorem in all of Probability Theory. It is named after Reverend Thomas Bayes who leveraged conditional probability to make inferences about the probability of uncertain events taking place.

Today, Bayes’ theorem underpins the myriad of Bayesian Algorithms routinely used in Data Science. It also provides the basis for lightweight, yet powerful Bayesian Inference. Without further delay, here’s the theorem.

Bayes’ Theorem

Bayes Theorem

If the formula above makes your head spin, take a deep breath and sit back, we’ll break it down as we go. Let’s start with the first part.


When spoken, this is read as "the probability of (event) A given (event) B (has occurred)". In other words, if we know or assume event B has taken place, what is the probability that event A has also occurred? Formally, this expression is referred to as the posterior.

For a simple illustration, let’s pretend that event A = has long hair and event B = is a female. P(has long hair | is a female). In this case, we’re interested in finding the probability of an individual having long hair provided that the individual is a female.


The first expression, P(B|A) is simply the reverse of the posterior previously discussed. This expression is formally referred to as the likelihood and is read as "the probability of (event) B given (event) A.”

Continuing with our females with long hair example, this expressions examines the probability of an individual being a female provided that we know or assume that the individual in question has long hair. P(is a female | has long hair).

P(B|A) is then multiplied to P(A), read as "the probability of (event) A". In our example, A = has long hair, so P(A) can be found by dividing the number of long-haired individuals by the entire number of individuals in our sample.


Finally, the denominator P(B) is, as you may have already guessed, simply "the probability of (event) B". In our case, the probability that a randomly selected individual is a female. We can obtain this by dividing the number of females by the total number of individuals in our sample.

Whew, that was a bit tricky. Give yourself a pat on the back before we dive in and apply the theorem to the hair length and gender example.

Application: Hair Length and Gender

Problem One: P(has long hair | is female)

Before we can begin calculating, we need a dataset to draw on. Let’s pretend we have a dataset of 100 individuals, 50 of which are female. Among the females, 40 have long hair while only 5 males happen to have long hair. Below is a tree to better illustrate the break-down.

Now we can get to work on all the pieces

  • P(A) : P(L) = 45 long-haired individuals/100 individuals = 0.45

  • P(B) : P(F) = 50 females/100 individuals = 0.5
  • P(B|A) : P(F|L) = 50 long-haired females/55 total long-haired individuals = 0.9090

Plug-in everything calculated above and…

According to our model, the likelihood of an individual having long hair provided that person is a female is 0.8181 or 81.81%! The probability is so high due to the fact that 80% of females in our dataset have long hair. That’s a massive contrast compared to only 10% of males.

Problem Two: P(is female | has short hair)

Let’s use the same dataset but instead solve for the probability of an individual being female given the person has short hair: P(F|S), with S representing "short hair".

Solve for all the different components

  • P(A) : P(F) = 50 females/100 individuals = 0.5
  • P(B) : P(S) = 55 short-haired individuals/100 individuals = 0.55

  • P(B|A) : P(S|F) = 10 short-haired females/50 females = 0.4

According to our model, the likelihood of a short-haired person being female is 0.3636 or 36.36%.

I hope the explanations and break-downs in this post were easy to follow and insightful. If there are any details I left out or anything that remains unclear, please let me know in the comments below! If you’re interested in more concrete real-world applications, take a look at my Naive Bayes Classification project that can predict the author a Tweet!