登录查看更多内容

Elections. Predictive Analytics. Sampling.

Prashanth H Southekal, PhD, MBA, ICD.D

发布日期: 2016年11月10日

How come most pollsters and the exit-polls badly misjudged Donald Trump’s victory in the US Presidential elections? While "Data is Dead", “Big Data Ineffective”, “Lack of contextual data”, and many more are being discussed, I believe one of the issues that is not talked much (at least in the last 24 hours) is sampling behind predictive analytics. While just discussing sampling might be an oversimplification of the complex US electoral process, this post looks sampling as it is one of the key principles of predictive analytics.

Assuming we have “quality” data, predictive analytics whether in politics or business or in any other situations depends on good sampling. Fundamentally, sampling is a key factor in inferential or predictive analytics as it is used to make inferences about the population. Sampling is based on three main elements – the sample size, the randomization in the selected sample data set, and finally the statistical significance or the p-value in the analysis.

1. Sample size

Sample size is the number of observations to include in the statistical tests. The sample size is dependent on three key factors.

Population Size. It is the approximate total size of the population and 200 million is the number of registered voters in US.
Margin of Error i.e. confidence Interval or the amount of uncertainty associated with a sample. It is usually 3%.
Confidence Level. It is a measure of how certain you are that the sample accurately reflects the population. Confidence Level in most cases will be at 95%.

Applying these numbers in the sample size calculator (I used the tool in surveymonkey), a good sample size for predicting the election outcome comes to 1068. The RAND Presidential Election Panel Survey (PEPS), which is one of the few surveys that got the prediction right used a representative sample of around 3,000 voters.

2. Randomization

But sample size is just one factor. The next key aspect is the selection of data in such a way that proper randomization is achieved. Basically randomization ensures that the sample obtained is a representative of the population intended to be analyzed and is not biased in a systematic manner. The three most common sampling designs for randomization are:

Simple random sampling. Here the sample is chosen entirely by chance and each member of the population has an equal chance of being included in the sample.
Stratified random sampling. A stratified random sample is obtained by taking samples from each stratum or sub-group of a population.
Multistage random sampling. In multistage random sampling the sampling is carried out in stages using smaller and smaller sampling units at each stage

Given the complexity of demographics (urban/rural, age, race, etc..) and the size of US, randomization has to be based on stratified and multistage random sampling. Again the PEPS survey used a representative demographic sample and asked the same people for their opinion repeatedly over time.

3. Statistical significance

So the right sampling i.e. a randomized sample size decreases the sampling error and increases the chance of getting a good prediction. The outcome can be supported with statistical significance i.e. probability value or p-value which is tied to the null hypothesis. Note that a low p-value is a good indicator of the prediction.

Sampling is the key for predictive analytics. Any misrepresentation of the population in the sample can lead to unpredictable results.

So What is the Advice?

Let us not blame “Big Data” or even technologies as the main culprit for this poor prediction when the ruling BJP party in India effectively used Big Data to come to power in 2014. Certainly the Indian electoral scene is as complex if not less as the American one. Finally, “Rules before Tools”; let us put process before technologies. The process should be based on a solid null hypothesis, sample size, and randomization techniques.

Let me know your thoughts. I'm here to learn.

*****************************************************************************************

Prashanth Southekal is a technology professional who understands what it takes to run efficient technology based solutions, processes, and organizations. He brings over 20 years of experience in Information Management from companies such as SAP AG, Accenture, Deloitte, P&G, and General Electric. Prashanth has published 2 books on Information Management and he is currently working on his 3rd book - "Data for Business Performance".

First Book in Amazon

Second Book in Amazon

*****************************************************************************************

Read other popular posts from Prashanth.

Data Isn't Everything - https://www.dhirubhai.net/pulse/data-isnt-everything-prashanth-southekal-phd-pmp?trk=mp-author-card
Isn’t Real-time Analytics an Oxymoron? - https://www.dhirubhai.net/pulse/isnt-real-time-analytics-oxymoron-prashanth-southekal-phd-pmp?trk=mp-author-card
Why are CDOs Challenged? - https://www.dhirubhai.net/pulse/why-cdos-challenged-prashanth-southekal-phd?trk=mp-author-card
Data Driven Enterprise (DDE) – The Failure Patterns- https://www.dhirubhai.net/pulse/data-driven-enterprise-dde-failure-patterns-prashanth-southekal-phd?trk=mp-author-card

Srinivas Radhakrishna

Director-Technology, Strategy & Transformation (SAP)

8 年

I am sure the predictive analytics model used by the Elite Publications have a unique model. SO the Newyork Times, The Economist and The Washington Post have their predictions skewed towards Hillary and the Low profile magazines like The Sun etc have their predictions skewed towards Trump. So you now, know to what extent their allegiance is skewed in. So more than user answers to questions, it becomes important to gauge in user behavior on the social media as a right tool to be used to predict. A few weeks before the elections, Facebook tracked the behavior of users and predicted Donald Trump sweeping the polls and it seemed to be bang on. So the old models of predictive analytics and forecasting/simulation techniques is dead.Users answer differently but their behavior cannot be changed and hence there is a need for this component to be baked into their models

3 次回应

Guruprasad K.J, CPIM,CSCP,PMP

Supply Chain Professional | Artist | Data Analytics |

8 年

Nice one Prashanth. The US elections is of different type which has both popular and electoral vote. This is tricky and this aspect need to be considered in statistical analysis.. As you said sample size over different regions, gender, age matters..

Erik van der Voorden

Enterprise Architect | Business & IT

8 年

Basics matter...always!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Elections. Predictive Analytics. Sampling.

Prashanth H Southekal, PhD, MBA, ICD.D

更多精彩文章

社区洞察

其他会员也浏览了

AI's Pivotal Role in the 2024 Indian Elections: A Double-Edged Sword

Unveiling the Power of Analytics in Indian Electoral Dynamics

Insights with Power BI in India's 2024 Elections

What RevOps can learn from election politics

Predicting Elections and Human Behavior: Stop Asking and Start Observing

Title: Leveraging Power BI to Monitor Elections: Enhancing Transparency and Insights

The Power of Voter Data: Why Data is King in Winning Elections

Good, Bad, and Ugly: The Complex Role of Analytics in Modern Elections

14 Reasons Political Polls Cannot be Trusted and Shouldn’t be Published to the Public

Can a Business Analyst Propel Kamala Harris to a Historic Win Over Donald Trump?

Framework to Manage Risks in Business Enterprises

2022年6月15日

Digitization and Data Analytics in COVID-19 Crisis

2020年4月17日

Calgary – Count your Blessings

2019年11月2日

Effective Enterprise Performance Management (EPM) with KPIs and Dashboards Training

2018年10月9日

2-Day Enterprise Data Analytics Training

2018年8月24日

TRANSFORMING QUANTITATIVE DATA TO CATEGORICAL (QUALITATIVE) DATA

2018年1月22日

A Business Case to take the SAP S/4 HANA Journey

2017年7月13日

The Case for Paradigm Shift in Digital Transformation

2017年3月14日

Data and Information Management Case Studies

2016年10月27日

Update on my Book - Data for Business Performance

2016年9月23日