登录查看更多内容

Building a model? Here is the first question you should ask

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

发布日期: 2019年2月18日

Someone somewhere right now is building a model. Many, many people in fact. Whether for a business, an academic study or even personal interest, people have been using mathematics more and more to model real world phenomena in order to generate insight or to make decisions on how control or to respond to those phenomena.

More recently - enabled by greater computing power - modelling has become more complex. Instead of a few cells in an Excel spreadsheet, models are being built on various platforms and in various programming languages. Some are based on small data, and some are based on huge data. The efforts to create them can range from a few hours to an iterative project lasting months or even years.

But often the creators of these models don't ask enough questions before they start. They can just jump into it without thinking. Grab some data, set up some formulas and you are off. I've learned over many years working in mathematics and statistics that the success of your model depends a great deal on the up-front thinking that goes into it before you even open a data file.

In particular there is one questions which I always ask at the very beginning - and one which I believe analysts, data scientists and other modelers should always ask: Is my model supposed to be explanatory or predictive?

It's probably obvious from the words, but an explanatory model is created to help understand why something is happening. It can help answer questions like: why does this disease seem to occur in these types of people? What might have caused temperature surges? A predictive model is created to make predictions as accurately as possible regarding what will happen - it will answer questions like: how many people can we expect to visit this shopping mall tomorrow? How many votes will each political party get at the next election?

One way to illustrate this quite simply is to use the analogy of a lemonade stand owner. The lemonade stand owner would use an explanatory model to understand the reasons why her customers like her product, or why she has more customers in the middle of the day versus in the evening - she's basically interested in the lemonade and why it sells. However, if her main aim is to make sure she has enough lemons for the rest of the week, she would use a predictive model to help her with that.

Models are rarely optimally able to achieve both goals. I don't think I have ever built a model that is both great at explaining a phenomenon and equally great at predicting that phenomenon. And there are good reasons for this. In this article I will lay out how this choice affects every part of how you build a model, starting with the initial data inputs all the way to how you measure its effectiveness.

1. Choices of input data (one off or repeated use)

If the model is to be explanatory, then the modelling process is to happen only once, or on occasion in the future. The priority it to get the deepest possible understanding of the question. Therefore no data source is out of scope. Data that is poorly formatted and needs substantial cleaning can go on this list. Even old data that does not exist electronically and is still in filing cabinets could be considered for digitization in an effort to be as exhaustive as possible. Equally, certain data might be removed from the model for the purpose of unearthing deeper explanatory variables. In a medical model, age might be removed because it is a known factor in disease susceptibility and it might dominate the model and disguise other important factors.

A predictive model is designed to be run again and again so that the relationship identified in the training set can be utilized to make predictions based on new data that is fed into the model. Therefore the data is selected primarily based on how available it will be to run through models in the future. In many modern day contexts this often means predictive models are restricted to only use data that are in connected sources, readily available and pre-formatted to work with the model. In addition, usually the primary goal is accurate prediction, and so any data that helps improve the accuracy of the prediction is in play (although there should usually be a healthy discussion on the trade off between accuracy and inductive bias in predictive models).

2. Modelling techniques used (interpretable or 'black box')

For an explanatory model, modelling techniques that lend themselves well to interpretation are critical. Control of insight is of supreme importance in an explanatory model. In Logistic Regression, odds ratios can help us understand the degree to which an input variable influences the dependent variable. Simpler decision tree models can have useful explanatory purpose, because they can help identify and quantify the impact of certain decision points on the result.

Predictive modelling has little regard for interpretability. You may have heard the term 'black box model' to describe a model which maximizes predictive power but is far too complex in nature to tease out the influence of the individual input factors. Neural networks are quite common black box models. They are highly complex under the hood and make decisions based on many hundreds or thousands of simulated and interconnected neurons, each one acting on a behavior learned from the training set.

3. Measuring the performance of the model (fit versus accuracy)

Explanatory models are judged primarily by the insights they produce and their overall goodness of fit. The goodness of fit is a measure of the closeness between the expected values of the dependent variable and the actual observed values. It is possible, and indeed quite common, for an explanatory model to to generate valuable insights even if the overall fit is poor - this is quite common for example in the field of social sciences that I work primarily in. Typical measures used in the results of explanatory modelling include odds ratios, R-squareds (incl pseudo R-squareds), chi-squared tests and G-tests.

Predictive models live or die based on their accuracy. Accuracy measurement usually involves a calculation of the error in a regression model, or the tradeoff between true positives and false positives in a classification model. Measures such as mean absolute error and root mean squared error will typically be used to describe how well a regression model makes predictions. Precision, recall, the area under an ROC curve or the F1-score (for imbalanced models) are more typical measures used for evaluating predictive accuracy in a classification model.

I have learned the habit over the years of putting myself in the shoes of the lemonade stand owner. Am I interested in the lemonade or the lemons? It's a really good habit which I hope you can pick up.

I lead McKinsey's internal People Analytics and Measurement function. Originally I was a Pure Mathematician, then I became a Psychometrician. I am passionate about applying the rigor of both those disciplines to complex people questions. I'm also a coding geek and a massive fan of Japanese RPGs.

All opinions expressed are my own and not to be associated with my employer or any other organization I am connected with.

Ashwin Elumalai

Senior Principal Data Services - Data Modeling & Architecture | Data Engineering & Analytics at Mr. Cooper Group

5 年

Perfect differentiation between explanatory and Predictive models (based on regression) . Thanks Keith for sharing.?

Ludek Stehlik, Ph.D.

People & Data Scientist @Sanofi

5 年

Maybe this distinction between explanatory and predictive models will blur to some degree with rise of algorithms/packages like IML, LIME or DALEX that enable analysts to explain and understand inner workings even of "black-box" models. Shirin Glander has great presentation (https://goo.gl/vzXrSK) and series of articles about this topic ( https://goo.gl/vpq3cu ), both from ethical and technical perspective.

2 次回应

Mark Spivey

Helping us all "Figure It Out" (Explore, Describe, Explain), many Differentiations + Integrations at any time .

5 年

in this question of yours, does Explanation subsume both Exploration and Description ... i.e. the entire Scientific Method, or "how do we make sense of the world", and Prediction subsumes "so that we may act in it" (Prescriptive) ?

Marrein Agwaro, MS. MBA.

Workforce Analytics. People Analytics. Predictive Analytics

5 年

This is a great narrative! Thanks for sharing.

Kathi Enderes

5 年

This is great, Keith. One question to ask before this: what’s the problem I am trying to solve?

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Building a model? Here is the first question you should ask

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

1. Choices of input data (one off or repeated use)

2. Modelling techniques used (interpretable or 'black box')

3. Measuring the performance of the model (fit versus accuracy)

更多精彩文章

社区洞察

其他会员也浏览了

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Understanding Gaussian Mixture Models (GMMs) - The Probabilistic Modelling

Feature Selection In Machine Learning Version 1.0('Layman words') !!

Unleashing the Power of Data: The Art and Science of Feature Engineering

Effective XGBoost by Matt Harrison

Back-tested Models: Unveiling the Past to Predict the Future

Error Analysis & the Baseline Model: A Love Story ??

How logistic regression can save the day?

Support Vector Machine- Simple analysis

Model Dimensionality and Overfitting

1. Choices of input data (one off or repeated use)

2. Modelling techniques used (interpretable or 'black box')

3. Measuring the performance of the model (fit versus accuracy)

A Fun Introduction to the Concept of Bayesian Statistics

2024年11月25日

The Italian Origins of Imaginary Numbers

2024年9月23日

The Beauty of the Binomial Expansion

2024年8月28日

My Top Tip for Tackling Tough Math Problems

2024年8月21日

The Three Most Common Statistical Tests You Should Deeply Understand

2024年8月12日

The Trick That Helps All Statisticians Survive

2024年8月6日

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

2024年7月31日

Two Fascinating Properties of the Fibonacci Sequence

2024年7月16日

How To Summarize Public Opinion Using RAG AI

2024年7月15日

The Beautiful and Useful Applications of Logarithms

2024年5月28日

社区洞察

其他会员也浏览了

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Understanding Gaussian Mixture Models (GMMs) - The Probabilistic Modelling

Feature Selection In Machine Learning Version 1.0('Layman words') !!

Unleashing the Power of Data: The Art and Science of Feature Engineering

Effective XGBoost by Matt Harrison

Back-tested Models: Unveiling the Past to Predict the Future

Error Analysis & the Baseline Model: A Love Story ??

How logistic regression can save the day?

Support Vector Machine- Simple analysis

Model Dimensionality and Overfitting