A poll published today by Survation on behalf of the Liberal Democrats looks at voting intentions in the constituency of Cambridge, a key target seat for Jo Swinson’s party. The headline figures of a 9% lead for the Liberal Democrats seem encouraging for them, but is that what the poll actually tells us? Welcome to the wonderful world of sample sizes, margin of error and monte carlo simulations as we explore what this poll is actually telling us.

Survation are clear in their data tables that with a sample size of 417, there is a large margin of error on the poll. What does that actually mean? When a polling firm conducts an opinion poll they try and find a sample that represents the wider population that they are looking to measure the views of. For example, most of the national polls have a sample somewhere between a 1,000 and 2,000 people (though at the last election YouGov used a rolling sample of around 30,000). Can 1,000 people accurately measure the views of the entire population? Well if you try and find a 1,000 people who represent the electorate to a fair degree, the answer is “yes”.

The way that polling firms do this is to make sure that the different sub-groups of the people they ask questions of broadly fit the different sub-groups of the wider population. For example, if 25% of the voting population is in the 18 to 40 age group, the firm will try and make sure that 25% of their sample is in this group as well. Half the population male? Try and make sure that 50% of your sample is male. 9% live in Scotland? Make sure that 9% come from north of the border and so on.

Very often polling firms get close to these numbers but not exactly, so they will then weight each individual respondent to make sure that the overall sample fits the wider population as well as possible.

It’s worth also mentioning “over-sampling” here. This is where a firm might want to look specifically at a particular group in order to find out particular things about then. For example, a polling firm might be looking to explore national voting trends but also be interested in seeing what the trends are in London. They do this by taking a larger than normal sample for the area of interest which allows them to have a much more accurate picture of what’s going on in that particular group. They would then weight that group down for the overall poll. In our example of London, a normal poll with 2000 respondents would expect to sample in the region of 300 people in London to represent the population there. An oversample might try and find 750 people in London in order to have more accurate figures there, but then when it came to reporting the overall UK figures it would give each of those London respondents a much lower weight in order that in the overall figures London was reflected accurately.

Back to Cambridge then. The Survation sample is 417, and as they tell us themselves this gives a margin of error of 4.8% (two standard deviations). This means that when the poll reports the Liberal Democrats on 39%, there’s actually a 95% probability that the *true* level of support for the Liberal Democrats is somewhere between 34.2% and 43.8%. That range of probabilities isn’t uniform but rather is **normally distributed** which in lay man’s terms means it looks like the bell curve below.

If the margin of error is 4.8%, a Labour poll rating of 30% corresponds to a 95% probability of a vote between 25.2% and 34.8%. Now, that 34.8% is higher than the lower end of the probability range of the Liberal Democrat vote, so suddenly a 9% poll lead has turned into a narrow possibility that actually Labour are ahead in the constituency. How much of a probability? Well, we don’t actually know yet because there’s another subtlety at play here.

The 39% to 30% **isn’t** based on the full 417 people interviewed because not all of those people gave an answer to the question “who will you vote for” and even if they did that answer might be “I don’t know”. The number of people who actually answered the question with a positive response was 309, a little under 75% of the overall sample. As you might expect, the margin of error on a smaller sample is higher still, in this case 5.6%. The actual figures of support for the Liberal Democrats are 39.00% and 30.3% respectively, so the 95% confidence interval of the actual votes for each candidate is 33.4% to 44.6% for the Liberal Democrats and 25.7% to 35.9% for Labour.

If we represent that graphically we get the chart above. It’s clear from this that there’s a considerable overlap between the two candidates and that 9% poll lead is not a guarantee of a Liberal Democrat win. But what is the probability that on these opinion poll figures Labour might actually be in the lead?

We can calculate this mathematically very quickly with a simple formula that I won’t bore you with and it turns out that there’s around a 1.5% chance that the **true** position is actually a Liberal Democrat lead. Diving in a bit deeper we can calculate that the probability of that lead being less than 2% is around 5%, a lead of less than 4% is 12%, 6% is around 25% and 8% is around 42.5%.

Interesting isn’t it? A clear Liberal Democrat lead of 9% has actually turned into a 12% chance that the lead is really less than 4% and even a probability of 1.5% that Labour is ahead in the constituency.

It actually gets more complex though. In the calculations above we’ve assumed that the Liberal Democrat and Labour votes are **independent**. This is a technical statistical term, but for our opinion poll example what it means is that so far we have assumed that the level of support for the Liberal Democrats and for Labour are entirely unconnected. In reality, if the support for one party goes up, it’s very likely that support for the other goes down. To put it another way, if the actual level of support for the Liberal Democrats is below 39%, there’s a good chance that the actual level of support for Labour will be above 30%.

This concept is called **covariance** and it makes our calculations a little bit more complicated. We can’t tell from the opinion poll what the covariance is between the two leading parties in Cambridge, let alone amongst all the candidates standing. What we can do is make some assumptions and see how that affects our figures.

Let’s assume complete negative covariance between the two parties (i.e. if the true vote for one is above the poll figure then the vote for the other will be correspondingly below the figure. Now we have an almost 6% chance of Labour being in the lead, a 12% chance that the Lib Dem lead is actually 2% or less, 20% that it is 4% or less, around 31.5% that it is 6% or less and 45% that it is 8% or less. Put it another way, by assuming complete covariance the chances that Labour are actually in the lead based on this poll are four times as high as when we treated the votes as independent.

The true level of covariance is probably somewhere between these two positions. One thing’s clear, if you were to look at this poll and say “the Liberal Democrats have a 9% lead in the constituency” you would be wrong. What the poll tells you is that there is between a 94% and 98.5% probability that the Liberal Democrats are in the lead in the constituency, that the probability of that lead being 6% or greater is somewhere between 68.5% and 75%, or any other interesting statistical observation.

Is the lead exactly 9%? Well the probability of that is around 1.3%, so chances are the Liberal Democrats are NOT leading Labour by exactly 9%.

The next time you see an opinion poll headline, remember that the true picture is actually more complicated. A sample of a wider population can only really give you a view of the probabilities of the wider picture, not a definitive statement of what is and isn’t true.