Recently we tweeted an interesting article on big data, from the Financial Times.

The author’s key point is that sampling bias and sampling error are possible even with large data sets. As illustration, the author discusses a classic case where the Literary Digest incorrectly predicted that Alf Landon would beat FDR in the 1936 election. The prediction was wrong despite the fact that a very large polling sample was used.

A more recent example is the news that Strava — a social media website for competitive cyclists and runners — is selling its data to cities to help them plan better cycling infrastructure. The story is compelling because it feels like a clever, low-cost way to get insight into where new bicycle lanes might be most helpful. But I question whether Strava’s user base is the same as the larger population of casual urban cyclists.

Turning back to the election example, I thought a few of you might be interested in a more mathematical treatment. The rest of this post presents a simplified mathematical model of election polling which shows, first, that sample size matters, and second, that random sampling is critical.

Let’s assume for starters that the pool of eligible voters has size *V*, and every eligible voter has already made up his mind. Let’s further assume that there are only two candidates and that every eligible voter will in fact vote — hey, it’s just a model! Under these circumstances, some fraction *p* of the electorate will vote for candidate *A* come Election Day. Our task as polling wizards is to estimate *p*.

If we could ask everyone in *V*, we could compute *p* exactly (this is “N=all” in the article’s parlance). However, if *V* is big, the cost of asking everyone is just too great. In practice, we will sample only a subset of size *N* from the electorate and try to estimate *p* from this subset.

Of course, once we ask one person, there are only *V-1* people left to sample. We will assume that *V* is big enough that the ratio of people who favor candidate *A* (i.e., *p*) is only trivially changed as we conduct our sampling — which removes people one by one from the pool.

Now we need some math to get any further. Let’s define the random variable *X**i* to be 1 if the i’th person in our sample favors candidate *A*, and 0 if he favors candidate *B*. *X**i* is a Bernoulli distributed Random Variable (RV) with parameter *p*. The following sum is a natural choice for estimating *p *and would no doubt occur to most of you (here we use the Greek letter rho, which looks like a *p*, to denote our estimate):

\rho_N=\dfrac{1}{N}\displaystyle\sum_{i=1}^{N}X_i

We want to know what happens to this estimate as *N* increases. Intuitively, it will get better. But how fast and under what conditions?

To answer this, we need to know the mean (expected value) and variance of the *X**i* RVs. The mean of each *X**i* is just *p*; this follows from the definition of the expected value (applying the definition yields 1 times *p* plus zero times 1 minus *p*). It is similarly easy to show from definitions that the variance of each *X**i*, is *p(1-p).*

Our next task is to derive the mean and variance of rho. It is critical to keep in mind that our estimator is also an RV! For the mean we have (using *E* as the expectation operator):

E[\rho_N]=\dfrac{1}{N}\displaystyle\sum_{i=1}^{N}E[X_i]=\dfrac{1}{N}\displaystyle\sum_{i=1}^{N}p=p

This is a nice result. It shows that the expected value of our estimator random variable is exactly the right value: *p*. We chose well. However, having the right expected value does not mean you will get exactly the value *p* from your polling sample. Depending on the variance of the estimator, the value you get is, however, more or less likely to be close to *p*. So let’s compute the variance of the estimator. The first step is to compute the expected value of the estimator squared. Once we know this, we can subtract the square of the estimator’s mean from it, and we will have the variance of the estimator (this is a classic result from probability theory). We have

E[\rho_{N}^2]=\dfrac{1}{N^2}\displaystyle\sum_{i=1}^{N}\displaystyle\sum_{j=1}^{N}E[X_{i}X_{j}]

Here we encounter the first stumbling block. To simplify this further we need to assume that the *X**i* and *X**j* RVs are *independent* of each other. This is equivalent to assuming that there is no “sampling bias,” i.e., no systematic bias in our grabbing of people from the pool of eligible voters. If such a bias is present, then the variables are *not* independent, and we *can’t* simplify the expression. However, if we are truly plucking people from the pool at random, then for *i* not equal to *j*, the expected value of the product of the two RVs is just the product of their expected values. This allows us to write

E[\rho_{N}^2]=\dfrac{1}{N^2}\displaystyle\sum_{i=1}^{N}\displaystyle\sum_{j=1}^{N}E[X_{i}X_{j}]=\dfrac{1}{N^2}\Bigg(\displaystyle\sum_{i=1}^{N}E[X_{i}^2]+\displaystyle\sum_{j=1}E[X_{i}]E[X_{j}]\Bigg)=\dfrac{1}{N^2}(Np+(N^{2}-N)p^2)

If we subtract the mean of our estimator squared from the right hand side of the above (i.e., subtract), we get the variance of our estimator. With a little bit of algebra this becomes:

VAR(\rho_{N})=\dfrac{p(1-p)}{N}=\dfrac{\sigma^2}{N}

This shows why the sample size matters. As the size of our sample increases — as *N* gets bigger — the variance gets smaller, meaning that the distribution of the estimator RV becomes more and more tightly centered around the mean value *p*. In probability terms, this means that the probability that our estimate is off by a significant amount drops as *N* increases. Because our estimator is a sum of RVs, the Central Limit Theorem says that this sum will become well approximated by a Gaussian distribution as *N* increases. We can therefore easily estimate the probability that our estimator will be off, say, by three percentage points from the true value (i.e., that our estimate is more than 0.03 away from the mean value *p*). This leads to the concept of “confidence” in polling, and the ability to determine an appropriate sample size *N* to get a desired confidence level. Differences between the estimator’s value and the true value *p* (which is not known) are called sampling error. The best we can do with sampling error is to model it probabilistically.

As a concrete example, let’s consider a sample size of 3000, as mentioned in the article. We don’t know *p*, but we know that the variance of a Bernoulli distributed RV is largest when *p* = 1/2, in which case its variance is 1/4. Therefore, if we use a sample size of 3000, the variance of our estimator is upper bounded by 1/12,000 regardless of the true value of *p*. Using a variance of 1/12,000 we can look in a standard normal distribution table and find the probability that our estimate will vary from the true value of *p* by more than 0.03. The answer is that 999 out of 1000 times our estimate will be within 0.03 of the true value of *p — *as long as our samples are picked at random.