What makes you a great Data Scientist? (I assume you already are a good one.) Part 1: The Analytical Strategy [9 min read]
Foto von Mizuno K: https://www.pexels.com/de-de/foto/mann-schreibtisch-laptop-buro-12903271/

What makes you a great Data Scientist? (I assume you already are a good one.) Part 1: The Analytical Strategy [9 min read]

You want to ruin the entire company?

Excellent! Then you better not to use an analytical strategy. [a company] had to lay off 25% of its staff (which is a tragedy) due to a poor analytical strategy. So what happened?

We will come back to these phenomena more often in future posts of this series.

But first let's make a step back. What do we mean by an "analytical strategy"?

  • With an analytics strategy (or AI strategy), a company defines the steps and processes how analytics use cases can be generated, staffed, organized and implemented in order to achieve added value. Today we are not talking about this particular kind of strategy.
  • We are talking about this: An analytical strategy is an approach to solving a certain business problem sustainably using data science. The context could be a single insight-generating analysis or even an entire data software product.

So why all the fuss about strategy? Data science doesn't sound that "strategic" at all. It seems more like a craft. Let's say you want to forecast "X": You build the features. You try out different machine learning methods more or less automatically. You implement the procedures in an ML lifecycle platform. Done. Right?

The example of forecasting

Let's say you run a retail chain that sells colorful socks ??. There are 100 different types of socks and - lucky you - own 100 shops. Each item in each store is its own time series and therefore needs to have its own time series model. So you let your procedures automatically build 10,000 models and make sure that every sock in every shop gets the best model possible. After all, by definition, the best solution is if everyone gets the best model, right?

In my opinion, there are a few problems with this approach - which is just an example. Let's talk about ...

... a few components of a forecasting strategy

99 problems but the business ain't one

What is the business goal?

Does the store want to optimize their orders for next week? Does central purchasing department want to improve its orders for the next month? Do I want to identify trends in order offer new types of socks as private labels? This makes a big difference in the types of decisions to be made next!

What to predict = what to do?

As a store manager, should you order as many green flamingo socks as the forecast says? Usually not.

  • Inventory management: How many socks do you still have in stock?
  • Logistics cost: Doesn't a larger lot size make sense? E.g. order a little less often but more to save on shipping costs?
  • Also important: How volatile is the demand? Doesn't an interval estimator or calculating with a CDF make more sense than a point estimator, if you want to ensure a certain level of availability? What's the cost of availability? What's the cost of unsatisfied customers?
  • Would customers switch to blue llama socks if we ran out of green flamingo socks? Here you need a clever approach that can calculate this without having been able to systematically measure all the exchange relationships between the socks in reality.

This is just an example. It's not so much about the content of the discussion here, but rather about whether you systematically think about such things in the context of your strategy.

How to build models online (fast?)

What makes a good model good? Well, nowadays you can reliably and automatically determine the best model for a problem. However, this only applies under three conditions that are often forgotten.

  1. It can only be the best model for the problem formulated. If I optimize 10,000 "sock models", each model will be the best (if done right). But wouldn't one model for all socks be more accurate and stable? You only get answers to questions you ask. Asking the right questions is what makes it strategic. ;) This is core to an analytical strategy!
  2. This only works for the past as an indicator of whether the model will work in the future. It's a bet - and you don't even know the odds. The environment changes and thus causal relationships in the data also change. Maybe next year pink turtles will be the hot shit in the sock industry. Who knows? This can be prevented by frequent retraining in some cases. I admit, this is more about good machine learning craftsmanship than strategy - but nevertheless is forgotten way too often. And it's also a question of mindset.
  3. But it gets strategic again in a special case. Let's start with an analogy. Newton's physics works great for most applications here on Earth. "working" means that it makes good predictions. But once we're on other orders of magnitude (like speed of light) and space and time begin to warp, Einstein has to do the job. Today we know that. But is your model Newton or Einstein? Is it on earth or in space? So the question is: What happens if reality suddenly changes beyond the limits of the data observed so far, as we experienced during the corona pandemic in many fields? May I then assume that the same correlations still apply? Or more precisely: May I use a model that extrapolates the relations found into "unknown worlds"? In this case, the decision about linear models, tree-based models or ANNs is not a purely analytical decision. It requires an economic risk assessment, knowledge of the core premises of the models and a great deal of business sense - or domain knowledge in a broader sense. It requires an analytical strategy.

Should we therefore dump AI? No, the solution is to get the AI right. The solution is to have an analytical strategy.

How to borrow information (and never giving it back)

Back to the important things in life. From Einstein to socks. Creating one model per sock per store is often not a good idea. The data is very sparse and very noisy. Imagine that socks are not bought that often and are bought very irregularly. Sometimes one purchase per week for a product in a store, then again not for weeks. Then 3 at a time. The pattern can look completely different on other socks or stores. There is indeed a stable demand for socks (I assume). But how the demand actually manifests itself in a purchase is highly random. Random when, random where (stores, competitors) and random what (type of sock). Another physical analogy: Imagine a thunderstorm passing over you. You know there will be lightning. You are more likely to get struck by lightning if you choose the wrong place and time. But you will never know exactly when and where lightning will strike. It's basically the same with low-frequent sales data. Another. Example from my personal experience: I once stood in line at the cash register in a German drugstore. An elderly gentleman in front of me bought 37 packs of "Fishermen's Friend". (A lot of lightning so to speak.) The staff was surprised. "Is he even allowed to do that?" He was. There was no rule against it. #germany . How could you have predicted that?

You could not have done that.

So we just live with the error that happened because of this special event. But... and it's a big but(t)... We don't want this singular event to change our forecast for the future, do we? We would order far too many socks. The same also works as a vicious circle. Imagine an item was accidentally not delivered. A negative outlier. Sales data goes down > forecast goes down > stock goes down > sales data goes down... and so on. This is just one of many examples of censored data. And the nasty thing about it is that the models always look great because the predictions affect reality. But what actually happens in the real world can be economically catastrophic.

To prevent this you need to estimate the actual demand for socks since you are just seeing a sparse random sample in your data. To do this you need to borrow information! Let me explain.

Level 1: Outlier detection is the simplest and most obvious form of borrowing information. We use information from other periods of the time series to determine whether a data point appears reasonable. We then replace the outliers with this value. However, using this procedure alone is more like a small patch on a bullet wound. You are not healing the actual problem. You should definitely consider the following options as well.

Level 2: The next step could be to build models on higher levels. For example groups of shops or groups of socks. These can be created using master data (e.g. sales regions, product groups). Ideally, however, such groups are built analytically and include socks or stores that behave in a similar way. This borrows information from other stores and socks to get closer to actual demand. But then you still need a model that distributes the demand from the higher levels to the lower ones. This is usually more stable. Positive side effect: Your analytical architecture becomes more resilient to changes in the products and stores. But what is the right level? Pro tip: It's all about the Bayes! With Bayesian statistics you can combine different levels.

Level 3: Going to a higher level is just a simple form of abstracting from the individual sock to a property of the sock (e.g. "contains flamingos"). However, one could also think of socks and shops as a combination of all their properties. The properties can then be used as features of for a model. These properties can be derived from master data, for example. Another possibility is to calculate these properties from the data. Here again there are two possibilities: Self-calculated (manifest) properties that are easy to interpret. Or latently calculated properties, as calculated in many recommender systems.

All options come with many advantages and disadvantages. The explanation in this article is certainly incomplete and exemplary. And again, it's not so much about the content of the discussion here.

The point, however, is to think about it systematically within the framework of a strategy. "trial and error" is not a bad thing. But you have to combine the knowledge you have gained into an overall concept.

Conclusion - So what possibly went wrong?

Back to the start. What probably happened in the company that had to lay off 25% of its staff ? The linked article says, and I oversimplify here: The company bet on the real estate market and tried to buy real estate for less than the projected value. The model went crazy and the company lost a lot of money. If you re-read the linked article, you might be asking yourself the following questions:

  • Were the fundamental properties of the model used known? Were there any thoughts on the extrapolation effect of the model?
  • Was the problem formulated correctly?
  • Was the volatility of the market modeled?
  • Was it only based on past data because the model had "worked" so far? Using leading economic indicators might have prevented that - even if they hadn't improved the model in the past. However, this information could have been borrowed.

It all looked good in the past. How could it have come so far? Let's review one of the first paragraph in this Text:

"So why all the fuss about strategy? Data science doesn't sound that "strategic" at all. It seems more like a craft. Let's say you want to forecast "X": You build the features. You try out different machine learning methods more or less automatically. You implement the procedures in an ML lifecycle platform. Done. Right?"

... right?

I want to make it clear that this isn't about the company specifically and I don't want to judge an entire company. It's about a phenomenon that I often encounter at conferences and with job applicants. But to make the danger tangible, I wanted to make an example. It's not about making scientists happy. There is a real danger in not having an analytical strategy.


Thanks for making it this far. :)


Next in this series of posts

  • What makes you a great Data Scientist? Part 2: To spot a great data scientist you need adverse circumstances. I mean it.
  • What makes you a great Data Scientist? Part 3: Business sense
  • What makes you a great Data Scientist? Part 4: Data Science "by the book" is just software engineering - use your own brain!
  • What makes you a great Data Scientist? Part 5: The Art of the KPI

Stephan Kaup

Experte für Kassendatenmanagement in SAP? | P2P und O2C Prozesse optimieren und automatisieren

9 个月

What I like best about this article is that it emphasises that you first have to find out what you actually want to predict and what impact this actually has. I don't want to know how much data is sometimes collected "aimlessly".? The example with the socks is wonderfully illustrative. #teamlama

Harald Erb

Data Enthusiast & Architect, Regional Principal Sales Engineer (????) at Snowflake

2 年

Thanks for sharing! Looking forward to the next post

要查看或添加评论,请登录

社区洞察

其他会员也浏览了