Ethan Template

Monty-Hall from Bayes’ Theorem from Scratch

Ethan L Pierson · September 2024

The Game

You are a contestant on the gameshow Monty-Hall Weekly and you have chosen a door. The host has just explained the rules:

There are three doors before you, labeled 1, 2, and 3. Behind one of these doors we’ve parked your dream car. Brand new. The keys are in the ignition. The other two doors, however, conceal nothing more than common barnyard goats.

You, dear contestant, have the privilege of selecting one door. At the end of the game, you’ll walk—or drive—away with whatever your selected door reveals. But here’s where it gets interesting: After you make your initial choice, I will open one of the two doors you have not selected. Then I will offer you the chance to switch to the other unopened door, or stick with the door you’ve already selected. Once you have your decision locked in, you will open the door and leave this stage with whatever lies behind it. Ready? Let’s play. Choose your door!

The host now opens one of the other doors and reveals a goat. As the familiar reader knows, we can improve our odds of winning the car from \(1/3\) to \(2/3\) if we now choose the opposite door.

Ah yes, of course. Wait. Uhmm. Why??? How???

The key to understanding why this strategy of switching our initial choice actually works lies in the following observation: Now that the host has opened one of the three doors, you now have more information than you had in the previous moment when all three doors were closed. You have new information. This idea is the heart of the jewel of probability known as Bayes’ Theorem.

Bayes’ Theorem from Scratch

Let us start by tossing aside any assumption of prior-familiarity with Bayes’ rule. How about a derivation from first principles? It’ll take some work, but don’t fear!

Unit

Here’s the very first thing to know about probability: In any context that we speak of probable events, the probability of all outcomes, together, must sum to exactly one.

Say we have some situation represented by a sample space \(\Omega\), which contains all possible outcomes. To get a handle on the generality this abstraction commands, Let’s simplify a little: Consider two specific events, \(A\) and \(B\), which are subsets of \(\Omega\). In symbols:

\[ \Omega = \{A, B\} \]

The probability of \(A\) plus the probability of \(B\) must be equal to one. In symbols:

\[ P(\Omega) = P(A) + P(B) = 1 \]

Immediately, this may seem quite abstract once again. Indeed, it is abstract by design. But abstractions are powerful things, at least so long as we ground them in clear examples. So, let us consider an archetypical situation our formalism has begun to describe. Hint: It may be the first thing that comes to mind when the word “probability” graces our ears: A coin toss.

Intuitively this makes sense. The situation of flipping a coin results in either the event that the coin lands on heads or the event that the coin lands on tails.

\[ P(\Omega) = P(heads) + P(tails) = 1 \]

Since the coin can only land on heads or tails (and not both), these events are mutually exclusive. Furthermore, one of these outcomes must occur, meaning the events are exhaustive. Hence, their probabilities sum to \(1\). This gives us a mathematical foothold into using probability as a tool. Let’s expand it!

Let us note that \(P(A)\) and \(P(B)\) can be literally any number so long as they sum to one. It’s simple enough to see that \(P(A)\) and \(P(B)\) will each be \(1/2\) in the case of a fair coin toss. The task now is to articulate this relationship in general. We can express this formally for \(n\) outcomes:

\[ P(\Omega) = \sum_{i=1}^{n} P(E_i) = 1 \]

This summation notation (denoted with \(\Sigma\)) is merely a concise way of writing the sum of \(n\) individual somethings, \(x\).

\[ f(x) = x_1 + x_2 + \dots + x_n \;\; \equiv \;\; \sum_{i=1}^{n} x_i \]

In our case, the “something” is the probability of an event \(P(E)\) in whatever situation is under our consideration. We are using the notation to sum the probabilities of individual events, as in:

\[ P(\Omega) = P(E_1) + P(E_2) + \dots + P(E_n) \sum_{i=1}^{n} P(E_i) = 1 \]

We add the subscript \(i\) as a way of indexing events, i.e., as a way of counting or telling them apart. The index of the the final event will have the subscript \(i=n\), that is \(P(E_n)\). Simply, it is the \(n^{th}\) event of \(n\) events, whereas all previous were the \(i^{th}\) event of \(n\) events. Restated, we have a collection of \(n\) events where \(P(E_i)\) is the probability of the event \(E_i\). This summation notation is an expressive and concise way of generalizing that the total probability of events in our space sum to one.

Complement

Now let’s think about how to express the relationship between the probability of an event \(E\) happening and the probability of the same event not happening. We use \(\neg E\) (read as not E) to denote the outcome that \(E\) does not happen. Just as before, it is necessarily the case that the probabilities sum to one: \[ P(E) + P(\neg E) = 1 \] Imagine cutting a slice from a pie and then asking a friend to place a candle somewhere on the pie. If the slice is \(E\) and the rest of the pie is \(\neg E\), we can interpret the above as expressing the fact that the candle is necessarily either on the slice or on the rest of the pie that we’ve cut it from. This particular illustration, we should note, is gestural rather than formal. Nevertheless, it should render unsurprising that we name \(\neg E\) the complement of \(E\). From here, one step of algebra reveals the compliment identity: \[ P(\neg E) = 1 – P(E) \]

This is a useful way of representing the relationship between probabilities of complimentary events. If we know one, we know the other. Thus far we have the tools to assess the probability of one-off events such as the a coin landing on heads, or rolling a four with a six-sided die. And from this we observed that, whatever the number of events, one of them will definitely happen. What else would make for valuable additions to our toolset?

Conditional

For instance: How would we measure the probability of an event occurring given that another event has already occurred? Let’s first imagine such a scenario. Consider a bag containing four colored marbles. Two of the marbles are red and the other two are blue. We plan to carry out the following procedure: We blindly retrieve a marble from the bag, note its color, and then set the marble aside. Say we intend to do this procedure twice, in a row. We draw our first marble and see that it is red. Now we ask ourselves: what is the probability that the next marble we retrieve is red?

We have come upon the notion of conditional probability. If \(A\) and \(B\) are events, then the conditional probability, denoted \(P(B|A)\), expresses the probability of \(B\) occurring given that \(A\) has already occurred. In general, we can think of this as means of answering the question “Given that \(A\) happens, what is the probability of \(B\) happening?” Again, we can make sense of this intuitively. If we know that \(A\) has happened, we focus on just the outcomes where \(A\) occurs. We then evaluate how often \(B\) also happens in those cases. However, before we evaluate the above question, we must first consider what sort of relationship the events under consideration have to one another. We say that \(A\) and \(B\) are independent events when the occurrence of \(A\) does not influence the occurrence of \(B\). On the other hand, we say that \(A\) and \(B\) are dependent events when the occurrence of one does influence the occurrence of the other. Dependence and independence are two distinct types of relationships that events can have. In kind, there are two distinct ways to go about evaluating \(P(B|A)\).

Let’s first consider the case that \(A\) and \(B\) are dependent events (again, one does influence the other). What do we notice when we look to our marble scenario? We draw one marble, we see its color, then we draw another marble without replacing the first. Here, the act of retrieving a marble influences the subsequent act of retrieving another marble. Namely, it changes the number of marbles of a given color that remain inside the bag. So the events of our scenario are dependent. Let’s evaluate \(P(B|A)\) in this dependent case.

Let’s first phrase this as the sort of question that conditional probability answers: “Given that the first marble marble is red, what is the probability that the second marble we retrieve is red?” Since \(A\) denotes the event that has just occurred, we let \(P(A)\) denote the probability that our first draw was red. Before the first draw there were two red and two blue marbles in the bag, thus:

\[ P(A) = \frac{2}{4} = \frac{1}{2} \]

Naturally, \(B\) will denote the event of our second draw and \(P(B)\) the probability that the second marble is red. This particular moment highlights what conditional probability expresses in the dependent case: Had we asked the same question before drawing the first marble, we would not know if our second draw would be from a bag of one red and two blue marbles or from a bag of two blue marbles and one red. We start to see an inkling of what ultimately underpins Bayes’ theorem: additional information. But for the time being, we ask this question after our first draw. That draw was red. Hence, one red and two blue marbles remain in the bag. Thus:

\[ P(B|A) = \frac{1}{3} \]

Now, let’s consider the case that \(A\) and \(B\) are independent events (one does not influence the other). Then \(P(B|A) = P(B)\). The conditional probability reduces to the plain-old probability of a single event occurring. If we make a small tweak to our marble retrieval procedure, the scenario also reduces to this independent case. What’s the tweak? Notice: If we replace the marble after each draw, the likelihood of retrieving any given color will be independent of all other draws, as there would always be a constant number of marbles of a given color in the bag. In this independent case, the event that the first marble is red has no influence on the event that the second marble is red. In this case it’s easy to see that \(B\) is functionally identical to \(A\). We can plug in our numbers to confirm:

\[ P(B|A) = P(B) = \frac{2}{4} = \frac{1}{2} \]

Another alteration to the marble scenario gives us joint probability This describes the likelihood of two events happening together. The joint probability of \(A\) and \(B\) is written as \(P(A \cap B)\). We introduce the set intersection symbol \(\cap\) (read and), writing \(A \cap B\) to denote co-occurrence of events, \(A\) and \(B\) (i.e., \(A\) occurs and \(B\) occurs). By using our established rule of conditional probability, we can express joint probability of two events as:

\[ P(A \cap B) = P(A) \times P(B|A) \]

It’s nothing more than the product of the two.

Total

Finally, we must introduce the law of total probability, which allows us to express the probability of an event in terms of all the possible scenarios (or partitions of \(\Omega\)) that could lead to that event. If \(\{A_1, A_2, \dots, A_n\}\) is a collection of mutually exclusive and exhaustive events (meaning exactly one of them must occur), then the probability of any event \(B\) can be written as:

\[ P(B) = \sum_{i=1}^{n} P(B|A_i) \times P(A_i) \]

To see what total probability captures, we’ll have to consider a new marble scenario. Let’s start similarly: Imagine a bag containing two colors of marbles—red marbles and blue marbles—but there’s a catch: We actually have two such bags. We’ll name the bags Bag X and Bag Y. Bag X contains three red marbles and one blue marble. Bag Y contains 1 red marble and 3 blue marbles.

Our procedure starts by letting a fair coin toss assign which bag we will later retrieve a marble from. Say heads assigns Bag X and tails assigns Bag Y. We’ll let \(X\) denote the event that we are assigned Bag X, and likewise let \(Y\) denote the event we’re assigned Bag Y. Thus, \(P(X)=1/2\) and \(P(Y)=1/2\). After we’ve been assigned a bag, we blindly retrieve one marble from within. The question we want to answer is: What is the overall probability that the marble we draw is red?

Well, there are two sets of conditions under which we end up drawing a red marble, namely: either \(X\) and then we draw a red marble, or \(Y\) and then draw a red marble. We’ll use \(R\) to denote the event that we draw a red in marble in the end.

From here, the total probability formula tells us that we ought to add up the probabilities of drawing a red marble in each case, weighted by the probability of that case occurring. Consider \((R|X)\). That is, the event that we are assigned Bag X and then draw a red marble. We find:

\[ P(R|X) = \frac{3}{4} \]

So, the probability \(X\) and drawing a red marble is:

\[ P(X) \times P(R|X) = \Bigr(\frac{1}{2}\Bigl) \Bigr(\frac{3}{4}\Bigl) = \Bigr(\frac{3}{8}\Bigl) \]

Now let’s consider the case of \(Y\) then \(R\):

\[ P(R|Y) = \frac{1}{4} \]

So, the probability of \(Y\) and \(R\) is:

\[ P(Y) \times P(R|Y) = \Bigr(\frac{1}{2}\Bigl) \Bigr(\frac{1}{4}\Bigl) = \Bigr(\frac{1}{8}\Bigl) \]

The total probability of drawing a red marble is the sum of the probabilities from both scenarios:

\[ P(R) = \frac{3}{8} + \frac{1}{8} = \frac{4}{8} = \frac{1}{2} \]

We were able to break the scenario down into different paths (in our scenario, drawing red from Bag X or Bag Y), and then weigh the probabilities of each path by how likely it is to occur (being assigned Bag X vs Bag Y).

Congratulations! That’s it. We’ve now established the foundational rules of probability (actually, this guy named Kolmogorov did it. But that’s another story). What we truly have managed to do is establish all the fundamentals we need to move forward. It’s about time we assemble this belief-updating tool. Let’s derive it.

Let’s Derive Bayes’

Consider two probable events \(A\) and \(B\). As we just looked at, the joint probability of \(A\) and \(B\) occurring together is the probability of \(A\) times the probability of \(B\) given that \(A\) occurs:

\[ P(A \cap B) = P(A) \times P(B|A) \]

Total probability provides us the insight that our joint probability can be equivalently expressed as the probability of \(B\) times the probability of \(A\) given that \(B\) occurs:

\[ P(A \cap B) = P(B) \times P(A|B) \]

Given the equivalence of the two previous expressions, do you notice an identity we can exploit?

\[ P(A) \times P(B|A) \;\; \equiv \;\; P(A \cap B) \;\; \equiv \;\; P(B) \times P(A|B) \]

Thus we can set both sides of these equations equal to each other:

\[ P(B) \times P(A|B) = P(A) \times P(B|A) \]

And solve for the probability of \(A\) given \(B\) to get:

\[ \displaystyle P(A|B) = P(A) \times \frac{P(B|A)}{P(B)} \]

This is the equation known as Bayes’ Theorem. We should observe that the right hand side can also be written as a single fraction (by multiplicative distribution of the \(P(A)\) term):

\[ \displaystyle P(A|B) = \frac{P(A) \, P(B|A)}{P(B)} \]

Though it is irrelevant to our imminent application—and we are free to forgive ourselves any forgetfulness of the algebraic manipulation—it is still worth our time to note that both arrangements of symbols are exactly equivalent. The next time we encounter Bayes’ Theorem written-down in the wild, we will be sure to spot it. But first, and finally, it’s time to win…probably.

Monty-Hall from Bayes’

The canonical narrative specifies one fancy car and two goats hidden behind three doors. To ever so slightly simplify discussion of this problem, we will rename the doors to boxes: one box contains the prize and the other two boxes remain empty (in lieu of the car and goats). This may also appease the ever-vocal demographic who wish to walk out of the scenario with a pet goat in hand. It should be easy to see that nothing about the problem itself is changed.

So how do we apply our new mathematical toolset? Let’s begin by thinking about events: Where may the prize may be? We might consider the event that the prize is in some specific box. Okay, let’s try to describe the sample space of the prize being in some box. We already know that the prize is in one of the boxes. This much doesn’t help us. So lets impose something else that can pick-out a particular box. Any Ideas? Right: The box we choose.

We have our first event, \(A\): The prize is inside the initially chosen box. Note that we’ve also identified \(\neg A\), because either the prize is there or it isn’t. What other event can we identify? How about the very next thing that happens after we’ve made our initial choice: The host opens a box to reveal that it is empty.

Here, it’s worth a brief aside to clarify what the some of original audience of the Monty-Hall problem claimed to be an ambiguity in its statement: What happens if the host opens one of the two unchosen boxes and reveals the prize?! This never happens. The rules of the game, whether inferred or explicitly stated, insure this: The host always opens an empty box during the show. The host knows which box the prize is and so knows which boxes are empty. No matter which box we initially choose, the host has at least one box—of the two unchosen—to open and reveal to us its emptiness. At least one empty box is an important thing to observe here. And this observation lies behind many classical explanations of the apparently paradoxical nature of the odds. For now, let’s make a simple numerical observation: In the event that our initial box of choice box contains the prize, the host has the freedom to open either of the two empty boxes. And it makes no difference whatsoever which one is opened in this case. In the event that our initial choice of box is empty, the host is left with only one box that can be opened to reveal that it is empty.

We have our events. \(A\): The event that the prize is inside the initially chosen box. And \(B\): The event that the host opens a specific box revealing that it’s empty (let’s arbitrarily say that this is a particular box \(\texttt{No}\:\textbf{3}\), just to ease our discussion).

We know from the exposition of the game what some of our probabilities for these events will be. First, \(P(A) = 1/3\) (the prior probability that the prize is inside the initially chosen box). We also know \(P(B|A) = 1/2\) (if the prize is inside the chosen box, the host has two boxes to choose from to reveal they are empty).

Now how do we get \(P(B)\)? We can start by noting that \(P(B|\neg A) = 1\). This is a recollection of our earlier observation. If the prize is not inside the chosen box then the host has only one choice of box to open. In the case we have a partition of two events: \(A\) (the prize is inside the chosen box) and \(\neg A\) (the prize is not inside the chosen box), these are mutually exclusive and exhaustive. The prize must be either inside the chosen box or not. It can’t be both. Thus we can apply our rule of total probability!

\[ \displaystyle P(B) = P(B|A) \times P(A) + P(B|\neg A) P(\neg A) \; = \; \Bigl(\frac{1}{2}\Bigr)\Bigl(\frac{1}{3}\Bigr) + \Bigl(1\Bigr)\Bigl(\frac{2}{3}\Bigr) \; = \; \frac{1}{6} + \frac{2}{3} \; = \; \frac{5}{6} \]

Finally, we have everything we need to plug and play! \(P(A) = 1/3\), \(P(B|A) = 1/2\), and \(P(B) = 5/6\). Let’s remind ourselves one final time: We are going to find \(P(A|B)\), which is the probability that the prize is inside the initially chosen box, given that the host has opened box \(\texttt{No}\:\textbf{3}\) to reveal that it is empty. Let’s set up Bayes’ Theorem and evaluate.

\[ \displaystyle P(A|B) = \frac{P(A)P(B|A)}{P(B)} = \frac{(1/3) (1/2)}{(5/6)} = \frac{1}{3} \]

Since \(P(A|B)\) is \(1/3\) we know from the compliment identity that the probability of the event that the prize is inside the other unopened box (the one we’d switch to) must be \(2/3\). This is why switching improves our odds!

Let’s take a moment to reflect on the key insight of this whole story: The host’s action of opening a specific box is the source of information that allows us to update our probabilities. Bayes’ Theorem quantifies this update to our knowledge, converting our priors into posteriors. The host has given us all the information we need.

References

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B. (2014). Bayesian Data Analysis. 3rd ed. Chapman & Hall/CRC.
A great formal resource in print.

Kolmogorov, A.N. (1950) Foundations of the theory of probability. 1st Eng ed. Translated by N. Morrison. Chelsea Publishing Company. Available at: https://archive.org/details/kolmogorov_202112/mode/1up [Accessed 16 Aug. 2024].
There have been numerous English printings. At the time of this writing, the text is reprinted by Dover: ISBN13 978-0486821597.

Pearl, J. & Mackenzie, D. (2018) The Book of Why. Basic Books.
I highly recommend this read to anyone and everyone.