Big Data, Probability and Birthdays: Part 2 of 2

In Part One of this blog post, I discussed how to state an experiment in the form of probability spaces. Determining the sample space and the event space is necessary to be able to talk intelligently about probability measures, which is the topic of this post.

Approach 1: Counting

We’ve figured out the sample space and event space for the birthday problem already. So how about the probability measure? Let’s assume that people are equally likely to be born on any day of the year (this turns out not to be true, but let’s go with it). Then we should assign a probability of 1/365 to each outcome in the simple sample space underlying our vector sample space. We’ll further assume that people’s birthdays are independent of each other so that we can assign a probability to each vector as the product of the probabilities of each of its components. We see that the probability of each outcome vector is:

P(\omega)=\dfrac{1}{365^N}

where omega is our notation for a single outcome vector. Furthermore, it is easy to see that the size of the sample space is given as

Sample \space space \space size = 365^N

Now, how many outcomes are there in the sample space where no two components of the birthday vector are identical?

We need to answer two sub-questions to answer this question:

  1. How many ways can we select N distinct days from 365?
  2. Given a selection of N distinct days, how many ways can we assign those days to N people?

The answer to the first question is a combination, while the answer to the second question is a permutation. The product of the two is the number of vectors in the sample space with distinct components:

Number \space of \space vectors \space with \space distinct \space components = \dbinom{365}{N}N!

Now, since the probability of each outcome vector is the same, summing up the probabilities assigned to every outcome in our event yields the following product:

P(All \space vector \space components \space are \space distinct)=\dfrac{1}{365^N}\dbinom{365}{N}N! \newline \newline =\dfrac{365 \cdot 364 \cdot 363 \cdots (365-N+1)}{365^N}

Note that this answer corresponds to our intuition about probabilities: the numerator is the total number of vectors satisfying the distinctness criteria, and the denominator is the total number of outcome vectors.

There is another way to count the number of vectors in the numerator, and it may or may not feel more intuitive. It is a recursive argument that starts with N = 2 and goes like this: There are 365 ways to choose the first birthday; for each one of those 365 ways, there are 364 ways of choosing the second birthday such that it is distinct from the first. So there are 365^364 total ways for the first and second birthdays to be unique. In the case of N = 3, for each of the 365^364 ways of choosing the first two birthdays to be distinct, there are 363 ways to choose the third birthday to be different from both of the first two. So there are 365^364^363 ways in which all 3 birthdays can be different. Continuing the argument in this manner we see that for the general case of N distinct birthdays we get the numerator in the above expression.

Approach 2: Conditional Probability

In this section, I assume you’ve been exposed to the concepts of joint probability, conditional probability and independent events. (I did not discuss these in Part One of the blog post.) We seek the probability of the event that no two components of the birthday vector are equal. We can work our way up to that by starting with the probability that Birthday 2 does not equal Birthday 1. Since the events are independent, this probability is given as follows

P(B_{2} \ne B_{1})=\dfrac{364}{365}

Here B denotes birthday, and the subscript indicates the birthday of the person in question. Now, what is the probability that the third person’s birthday is distinct from both the first and second person’s? In a notation I hope is obvious (because I don’t want to define it!) we have

P(B_{3} \ne B_{2,1},B_{2} \ne B_{1}) \newline =P(B_{3} \ne B_{2,1} |B_{2} \ne B_{1})P(B_{2} \ne B_{1}) \newline =\dfrac{363}{365}\cdot\dfrac{364}{365}

Here we expressed the desired joint probability as the product of a conditional probability and a probability we already know. Then we used independence in order to “drop” the conditioning term. (Intuitively this is possible because the conditioning term depends on the birthdays of Persons 1 and 2, and the birthday of Person 3 is independent of those birthdays. So the probability that Person 3 is born on any given day is not changed by the fact that we know the birthdays of Persons 1 and 2.) Continuing the argument recursively the next step becomes:

P(B_{4} \ne B_{3,2,1},B_{3} \ne B_{2,1},B_{2} \ne B_{1}) \newline =P(B_{4} \ne B_{3,2,1}|B_{3} \ne B_{2,1}, B_{2} \ne B_{1})P(B_{3} \ne B_{2,1},B_{2} \ne B_{1}) \newline =P(B_{4} \ne B_{3,2,1}P(B_{3} \ne B{2,1},B_{2} \ne B_{1}) \newline =\dfrac{362}{365} \cdot \dfrac{363}{365} \cdot \dfrac{364}{365}

Continuing the argument recursively in this manner we obtain the following expression for the joint probability we desire

P(All \space birthdays \space distinct)=\dfrac{364}{365} \cdot \dfrac{363}{365} \cdot \dfrac{362}{365} \cdots \dfrac{365-N+1}{365}

Multiplying this equation by 1 in the form of 365 over 365 yields the previously derived equation from Approach 1.

Cool Fact

When N = 23, the probability that all the birthdays are distinct is 0.493. This means that the probability of the complementary set — the event where at least two birthdays are identical — is given by 1 – 0.493 = 0.507. So when just 23 people assemble, there is already a 50% chance that at least two of them have the same birthday. Clearly, if there are 366 people in the same room, at least two people must have the same birthday. Does it match your intuition that by the time 23 people have assembled, the probability that at least two of them share a birthday is already ½?