Hey, Nate Silver. I got you: Understanding Polling and Forecasts
Anytime a new individual joins FleishmanHillard’s social and creative team in San Francisco, she/he is subjected to a fast and furious round of 20 Questions. The group tends to ask your standard cocktail party questions like, “Name your top 5 musicians of all time,” “Star Wars or Star Trek?” or “Are you a cat person or a dog person?” Occasionally, someone will toss out, “If you could have dinner with one person dead or alive, who would it be?”
If I had to answer this question today, I’d have my answer ready.
Give me a table at The Big 4 on San Francisco’s Nob Hill, and put Nate Silver and me in the back, at one of those mahogany booths with the deep green, leather seating. Nate has been a career idol. He is the rock star of the bourgeoning data scene and is the most famous statistician to predict the outcome of presidential elections based on data analysis.
Wolf Blitzer began to make the country nervous as the presidential election results started flowing in Tuesday night. Why? As the Magic Screen started to turn Trump-red, America began to scratch its head. How can the outcome of this election be so very different from what FiveThirtyEight editor Nate Silver (and many of my peers) predicted?
According to Factiva, Nate Silver has been mentioned in the English global press almost 3,000 times in the past month – and almost half of those press mentions occurred in the past week. To be clear, these press mentions have not exactly been extoling the brilliance of Silver’s forecasting prowess. Headlines like, “Pollsters wonder: How did calculations go so wrong,” and “Wrong again: Why did most opinion polls predict a Clinton victory?” are the rule, not the exception.
Hey, Nate Silver. I got you.
This is the moment: November 11th, 2016. Today is that perfect storm, granting me the ability to explain (and hopefully help you understand) the science behind polls and forecasts in a real way. I’ll answer some general questions and eventually explain why the press needs to leave my man Nate Silver alone. (No worries, Nate. I got you).
Debunking the myth of perfect data
First and foremost, there is no such thing as perfect data. All of us see numbers, statistics, and figures every day. We take them as authoritative and absolute – even though we should not. The hidden truth is that we as humans have created a way to measure and observe the world around us – and, well, we are not perfect. We make mistakes. Therefore, all data are fundamentally inexact and flawed.
Are all polls created equal?
No. They are not. Due to limited time, money, and resources, it is impossible to collect results on every person in a very large population –like the entire U.S. voting-age population. So we take samples. Ultimately, we care about the representativeness of the sample data and our ability to make inferences about a SPECIFIC population. A representative sample is one that accurately reflects the larger population.
What impacts sample representativeness?
Honestly, just about everything. The wording of the questions. Sample size. How the sample is chosen. How the survey is fielded (via phone, in person or online). The biases of the individual who writes the survey. The biases of the person who analyzes the results. The hidden biases of the individuals who take the survey. Specifically, a concept known as herd behavior comes to mind. Herd behavior is “The tendency for individuals to mimic the actions (rational or irrational) of a larger group.” However, when separated from the herd, most people would not necessarily make the same decision as the larger group. While you might believe that a decision or idea is irrational or incorrect, you might still follow the herd, thinking that they have additional information you are missing. Herd behavior is very common in unfamiliar circumstances or situations in which people have very little knowledge or experience – such as in voting for a president.
Data freshness also impacts whether or not a sample is representative. Researchers, statisticians and scientists that conduct opinion polls like their data to be fresh – and, just like a new can of tennis balls, data can become stale and lose its representative bounce. Most importantly, as the size of the original audience/population gets smaller, it becomes increasingly more difficult to ensure that the sample includes the correct mix of individuals with different races, ages, education, and economic status to be representative. This is where the underlying assumptions of the math start to flat-out break.
The Redemption of Nate Silver
Nate and the folks over at FiveThirtyEight are beyond supportive of the data science community and have posted the polling data used to produce their 2016 Election Forecast over at Kaggle (Kaggle is a platform on which companies/researchers post data so folks like me can compete to produce the best models).
So I downloaded Nate’s polling data, scraped CNN’s website, overlaid actual election results, blended in the 52 tables of the U.S. Census Bureau’s Electorate Profiles, and did a little Mr. Wizard magic math to glean a better understanding of just how accurate Nate silver’s election forecast is. I’m awesome like that.
Let’s talk about the election polling data Nate used to forecast the election results. There were 189 unique vendors out there that conducted 2016 presidential election polls across 51 states/territories. When you that the U.S. Census Bureau estimates the number of Americans in the voting-age population to be 227,019,486 (+/-0.1%), a sample size of 40,816 (the largest sample in all of the nationwide polls) seems small. While that’s less than 0.018% of the total U.S. voting-age population, mathematical and probabilistic theory tells us that as long as we don’t violate specific conditions like representativeness, this is a sufficient sample size to estimate the preferences of the total U.S. voting population.
Each vendor has its own reputation in regards of accuracy. So how do we incorporate that information into our forecast model to make it more accurate? Just like my favorite professor taught me, Nate and the FiveThirtyEight team gave a reputation grade to each vendor. They assigned a grade for sample size: smaller samples scored lower, and larger samples received higher marks because they are more likely to be representative of the population. Also, state polls were graded higher than national polls, because they are more representative of each state’s specific voter population. Lastly, each poll was allotted a data freshness grade, with the most recent polls earning the highest rankings.
Much like your high school GPA, these scores were combined, and then used to weight the polling results. This means the best poll data – that is, the data from the biggest sample sizes, from state polls and from the most recent polls – was given higher regard in the analysis than data from the “low-grade” data. Various other mathematical adjustments were made; the final step was to simulate the election to reduce uncertainty in the forecast. Put simply, simulation randomly draws samples from a distribution so we can better estimate the behavior of a population. Think about this technique as filling up a mason jar of Skittles from the 10-pound bag 20,000 times and sorting them by color each and every time. Your estimate of the Skittles’ color distribution would be much more accurate based on 20,000 samples than it would based on just one.
Finally, FiveThirtyEight made their forecast (see below).
As I watched CNN’s poor John King frantically diving down into various state, county and city cuts of the data in an attempt to calm Wolf Blitzer down, I too wondered where my idol went wrong. My first thought was that Nate must have underestimated the amount of third- and even fourth-party votes. However, if you look at Nate’s prediction of the popular vote next to the election results, you see that his prediction for Hillary Clinton was less than half a percentage point off, which is surprisingly accurate. What emerges is a slight overestimation of third and fourth-party votes (over by only 2.5%) that Donald Trump ended up capturing. However, since we elect our president based on Electoral votes, the states where Trump won this 2.5% is key.
What’s more, Nate actually ran a sensitivity analysis and calculated the likelihood of each state’s outcome tipping the election. (Nate is my man!) When you compare Nate’s analysis to the states that actually tilted the Electoral College toward Trump, you get something frighteningly accurate: Florida, Pennsylvania, Michigan, North Carolina and Wisconsin. All of these states are very close to the centerline, which, as Nate explains, means a tight race. So yes, he accurately predicted the tight races and the states that would change the winner of the election.
If you look at Nate’s state-by-state predictions, you also see that the states that tipped toward Trump had some of the lowest probabilities of a Clinton win. Predicting the outcome of Florida and North Carolina, for instance, would be like predicting a coin toss.
What exactly lead to Nate’s 71% Clinton, 29% Trump prediction and the high chances of specific states tipping the election?
Non-representative data. I’m not pointing fingers or calling out any specific pollsters, so let’s look at a few factors of these “battleground states” that might lead to non-representative data:
- Florida’s Hispanic/Latino population is large – among the 10 largest Hispanic/Latino populations in the nation. The sample sizes for Florida polling data, however, were some of the smallest in the country.
- Similarly, polling samples were very low for the state of Pennsylvania, which is 10% Black/African America and 5% Hispanic/Latino.
- Michigan polling data had an average sample size of 860 individuals. Trump won the state of Michigan by an extremely narrow margin of 0.3% or 12,500 votes.
- North Carolina is roughly 28% non-white, which is higher than the National average of 20% non-white. Additionally, North Carolina has a slightly lower population of college graduates and a slightly higher population of people living below the poverty level. These populations are much harder to reach via surveys and their voter turnout is difficult to predict.
- Wisconsin is statistically 10% more White-Only than the rest of the country and therefore has lower populations of Black/African Americans and Hispanic/Latinos. Individuals in Wisconsin are also slightly less likely to hold a bachelor’s degree or higher. Once again these are difficult populations to poll and predict.
What did Nate Silver do wrong?
Remember that, in the absence of data to better inform our election outcome, the probabilities for Clinton and Trump would be split evenly 50%/50%, heads or tails. Nate’s projection gives the probability of the two major party candidates, Hillary Clinton and Donald Trump, as 71.4% and 28.6% respectively.
Nate, I’m so sorry. Other than a few minor assumptions about how to allocate undecided votes, you’ve done nothing wrong. You did the best you could with data that is likely non-representative – after all, our models are only as good as the data on which we build them. We are the problem. We, the majority of Americans and the media, have misinterpreted your data. As practitioners of research and data science, we your peers have failed to educate our organizations, clients and even each other on how to understand survey and forecast data – leaving the vast majority of people with the impression that you predicted Clinton to win by 71% of the vote.
What is Nate Silver’s forecast really saying?
If we held the national presidential election 100 times, Nate Silver’s data analysis would lead him to predict that Hillary would win 71 of those elections and Donald would win 29 of those elections. It just so happens that the United States randomly pulled one of the 29 elections that Trump actually won out of the proverbial hat (or mason jar of Skittles). This doesn’t make Nate Silver’s forecast bad, wrong, or fundamentally flawed in any way. It’s just a function of the adage, “Just because it is not probable, doesn’t mean it is not possible.”
Nate did the best he could with the information available prior to the actual Presidential Election. That said, the pollsters have some room for improvement – including incorporating larger sample sizes, new techniques for measuring the opinions of hard to reach populations, and the measurement of herd behavior.