登录查看更多内容

To model or not to model

Michael Leznik

VP of Data Science at Product Madness & Pixel United

发布日期: 2024年9月12日

"Essentially, all models are wrong, but some are useful." - George E.P. Box.

Quite often, a data scientist is confronted with a question: Why develop a model? The person who asks quite often wants a quick answer and going into the presumably cumbersome process of developing a model seems to him as a procrastinating exercise instead of just providing a quick answer. While there is nothing wrong in providing an answer on the 'back of an envelope', there is always a risk of losing grounds and consistency if no supporting model is developed. In this short article, we will try to understand what a model is and why one needs to have one.

Every one of us, especially those who drive a car, has asked ourselves a question: At what time should I leave home if I want to arrive at a particular place at a certain time? The estimation seems to be very simple:

This formula is believed to be known in Archimedes's times and accepted in the 1300s by the Merton scholars at Oxford. The proportion used in the context of speed, time, and distance is a deterministic physical model of uniform motion. At any moment, when one wants to calculate anything related to the concept: of time, distance and speed one uses this very ancient model. There is no need to reinvent or derive it again it is always there, and it very clearly demonstrates how time, speed and distance interact with each other.

So what is it so special about models that make our lives easier? There are a couple of things that I would have stressed out; structure and reusability.

Let us turn to another discipline which quite heavily relies on modelling - computer programming. When one learns to program their first programs regularly nothing but a file with a bunch of lines suggesting sequential execution. While such a file provides a quick answer to an individual problem reusing it for a slightly different formulation without rewriting everything from the beginning is almost impossible. That is the price of a quick answer once provided it becomes impossible to repeat as it doesn't provide any formidable foundations. At some point after recognising this drawback, one moves to write methods (functions) which, although provide some reusability still not abstract enough to give adequate modelling possibilities and it is only after introducing interfaces and abstract classes one reaches the full power of computer models. United into a structure by a UML diagram, these forms of abstraction, provide a clear understanding of what should be done and, later on, allow reusability and repeatability, of course, if the model is created and written coherently and lucidly.

The general theory recognises different types of models; numerical models, computer simulation models, political models, econometric models, etc. While there is no exact mathematical definition of a statistical model, we would happily adopt one provided by Peter McCullagh in his "What is Statistical model?". According to currently accepted theories, a statistical model is a set of probability distributions on the sample space. As opposed to the deterministic model statistical model introduces uncertainties expressed through the probability distributions.

领英推荐

The Myth of the Coder

ACM, Association for Computing Machinery 2 个月前

DSA Mastery: Understanding Space Complexity - A…

Manish V. 11 个月前

DSA Mastery: Time Complexity Unveiled - A Beginner's…

Manish V. 11 个月前

For the sake of simplicity, we wouldn't differentiate here between Bayesian and traditional statistical modelling as we are looking into the modelling on a very high level we should be okay with such generalisation. We just urge one to remember that when it comes to Bayesian modelling at a minimum one should have one more component which is prior distribution expressing the beliefs about a parameter under consideration.

In our industry, we are mostly interested in econometric models which the founding members of the Cowles Commission defined as: "a branch of economics in which economic theory and statistical method fused in the analysis of numerical and institutional data". In the modern parlour, models that combine explicit economic theories with statistical models are called structural econometric models. On the other side of the modelling spectrum, we have "reduced form" models. Under this umbrella, we have statistical models which don't refer to any specific economic theory, for instance, such as autoregressive conditional volatility models or a regression model based on some business-related covariates, but without any explicit economic theory behind it. Either of the models can be used, of course, for industrial purposes, although the use of non-structural models seems to be more prevalent.

A regression model explains 'leads' generated as a function of some marketing activity, an autoregressive forecasting model and market evaluation of a newly launched product just to name a few. In each of these and many other cases utilising the model thinking, we are able not only to obtain a desired numerical answer but to see how different parts of our model are interacting with each other. We know and hopefully successfully simulate relative and absolute changes in the observed variables (dimensions of our model) and understand what might cause these changes. Most importantly, of course, having a model we are capable of replicating these findings and making testable predictions that would serve as a hypothesis for any subsequent data analysis.

So what does one have to do to start thinking in modelling terms? It isn't difficult at all. Every time you are facing a problem, to start with, don't attempt to find an immediate solution as weird as it sounds. Try to think what forces are influencing the problem which of them might be variables and which might be constants, then think which of them are deterministic and which are stochastic, for a moment for clarity you may assume that everything is deterministic. At the next stage introduce your problem in the dimensions of your variables, such spatial representation helps to understand the relationships between your variables and visualising your problem. By then the solution or at least the way of reaching it should become more or less evident. Now you can start to add distributions to stochastic variables and abstract off deterministic factors through reasonable and plausible assumptions. Later on, you might come back and revise your assumption, of course. Voilà you have got a model. Formalise it by writing in mathematical notation or using a computer code and obtain your solution. Just remember while a solution for a given problem is necessary much more important is being able to replicate your solution and demonstrate how your solution might be affected, if underlying forces are changing this is practically impossible without careful formulation-modelling.

In conclusion, let me just quote George E.P. Box:

"Now, it would be very remarkable if any simple model could exactly represent any system existing in the real world. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an "ideal" gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore, its structure is informative since it springs from a physical view of the behaviour of gas molecules.

For such a model there is no need to ask the question "Is the model true?". If "truth" is to be the "whole truth" the answer must be "No". The only question of interest is "Is the model illuminating and useful?" and this is all that often matters folks.

To model or not to model

Michael Leznik

VP of Data Science at Product Madness & Pixel United

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

The Future: Mathematical Programming 4.0. - Real-Time Distributed Optimization in Cyber-Physical Systems .

The Evolution of Algorithms: Understanding the Seven Eras of Progress

[SE] An Introduction to SysML (System Modeling Language)

Notes on Data Compression: Part 5 (JPEG model)

Lets build a GPT style LLM from scratch - Part 2b, IndieLLM model architecture and full code.

How to optimally empty your bucket?

Comprehensive Machine Learning Solution

Data Structures and Algorithms

Paper Review: Mistral 7B

The abstraction trap

领英推荐

Understanding the non-linearity of Machine Learning

2024年10月30日

One hundred, or so days of solitude

2019年9月24日

Unbearable Lightness of Being a Data Scientist

2018年10月17日

社区洞察

其他会员也浏览了

The Future: Mathematical Programming 4.0. - Real-Time Distributed Optimization in Cyber-Physical Systems .

The Evolution of Algorithms: Understanding the Seven Eras of Progress

[SE] An Introduction to SysML (System Modeling Language)

Notes on Data Compression: Part 5 (JPEG model)

Lets build a GPT style LLM from scratch - Part 2b, IndieLLM model architecture and full code.

How to optimally empty your bucket?

Comprehensive Machine Learning Solution

Data Structures and Algorithms

Paper Review: Mistral 7B

The abstraction trap