Elections. Predictive Analytics. Sampling.
How come most pollsters and the exit-polls badly misjudged Donald Trump’s victory in the US Presidential elections? While "Data is Dead", “Big Data Ineffective”, “Lack of contextual data”, and many more are being discussed, I believe one of the issues that is not talked much (at least in the last 24 hours) is sampling behind predictive analytics. While just discussing sampling might be an oversimplification of the complex US electoral process, this post looks sampling as it is one of the key principles of predictive analytics.
Assuming we have “quality” data, predictive analytics whether in politics or business or in any other situations depends on good sampling. Fundamentally, sampling is a key factor in inferential or predictive analytics as it is used to make inferences about the population. Sampling is based on three main elements – the sample size, the randomization in the selected sample data set, and finally the statistical significance or the p-value in the analysis.
1. Sample size
Sample size is the number of observations to include in the statistical tests. The sample size is dependent on three key factors.
- Population Size. It is the approximate total size of the population and 200 million is the number of registered voters in US.
- Margin of Error i.e. confidence Interval or the amount of uncertainty associated with a sample. It is usually 3%.
- Confidence Level. It is a measure of how certain you are that the sample accurately reflects the population. Confidence Level in most cases will be at 95%.
Applying these numbers in the sample size calculator (I used the tool in surveymonkey), a good sample size for predicting the election outcome comes to 1068. The RAND Presidential Election Panel Survey (PEPS), which is one of the few surveys that got the prediction right used a representative sample of around 3,000 voters.
2. Randomization
But sample size is just one factor. The next key aspect is the selection of data in such a way that proper randomization is achieved. Basically randomization ensures that the sample obtained is a representative of the population intended to be analyzed and is not biased in a systematic manner. The three most common sampling designs for randomization are:
- Simple random sampling. Here the sample is chosen entirely by chance and each member of the population has an equal chance of being included in the sample.
- Stratified random sampling. A stratified random sample is obtained by taking samples from each stratum or sub-group of a population.
- Multistage random sampling. In multistage random sampling the sampling is carried out in stages using smaller and smaller sampling units at each stage
Given the complexity of demographics (urban/rural, age, race, etc..) and the size of US, randomization has to be based on stratified and multistage random sampling. Again the PEPS survey used a representative demographic sample and asked the same people for their opinion repeatedly over time.
3. Statistical significance
So the right sampling i.e. a randomized sample size decreases the sampling error and increases the chance of getting a good prediction. The outcome can be supported with statistical significance i.e. probability value or p-value which is tied to the null hypothesis. Note that a low p-value is a good indicator of the prediction.
Sampling is the key for predictive analytics. Any misrepresentation of the population in the sample can lead to unpredictable results.
So What is the Advice?
Let us not blame “Big Data” or even technologies as the main culprit for this poor prediction when the ruling BJP party in India effectively used Big Data to come to power in 2014. Certainly the Indian electoral scene is as complex if not less as the American one. Finally, “Rules before Tools”; let us put process before technologies. The process should be based on a solid null hypothesis, sample size, and randomization techniques.
Let me know your thoughts. I'm here to learn.
*****************************************************************************************
Prashanth Southekal is a technology professional who understands what it takes to run efficient technology based solutions, processes, and organizations. He brings over 20 years of experience in Information Management from companies such as SAP AG, Accenture, Deloitte, P&G, and General Electric. Prashanth has published 2 books on Information Management and he is currently working on his 3rd book - "Data for Business Performance".
*****************************************************************************************
Read other popular posts from Prashanth.
- Data Isn't Everything - https://www.dhirubhai.net/pulse/data-isnt-everything-prashanth-southekal-phd-pmp?trk=mp-author-card
- Isn’t Real-time Analytics an Oxymoron? - https://www.dhirubhai.net/pulse/isnt-real-time-analytics-oxymoron-prashanth-southekal-phd-pmp?trk=mp-author-card
- Why are CDOs Challenged? - https://www.dhirubhai.net/pulse/why-cdos-challenged-prashanth-southekal-phd?trk=mp-author-card
- Data Driven Enterprise (DDE) – The Failure Patterns- https://www.dhirubhai.net/pulse/data-driven-enterprise-dde-failure-patterns-prashanth-southekal-phd?trk=mp-author-card
Director-Technology, Strategy & Transformation (SAP)
8 年I am sure the predictive analytics model used by the Elite Publications have a unique model. SO the Newyork Times, The Economist and The Washington Post have their predictions skewed towards Hillary and the Low profile magazines like The Sun etc have their predictions skewed towards Trump. So you now, know to what extent their allegiance is skewed in. So more than user answers to questions, it becomes important to gauge in user behavior on the social media as a right tool to be used to predict. A few weeks before the elections, Facebook tracked the behavior of users and predicted Donald Trump sweeping the polls and it seemed to be bang on. So the old models of predictive analytics and forecasting/simulation techniques is dead.Users answer differently but their behavior cannot be changed and hence there is a need for this component to be baked into their models
Supply Chain Professional | Artist | Data Analytics |
8 年Nice one Prashanth. The US elections is of different type which has both popular and electoral vote. This is tricky and this aspect need to be considered in statistical analysis.. As you said sample size over different regions, gender, age matters..
Enterprise Architect | Business & IT
8 年Basics matter...always!