What makes you a great Data Scientist? Part 2: Bad Circumstances
Dr. Julian Mennen?h
?? Head of Analytics @ REWE Group | ???? Retail Captain hooked on AI
A matter of honor...
... is the back-translated title of the movie "A Few Good Men". Even though Germany is notoriously known for translating film titles into German in a horrible way, I think that the German title fits better in this case. The film is about which values we should stand for when we move in more than one value system. What is right in one value system (e.g. military life) may be wrong in another value system (e.g. civil life). Does that make you feel uncomfortable? It should. As data scientists, we often take the goal function for granted. We optimize the error of a model or the cost function of a business problem. We often think that it always has to be like this: There is only one goal towards which we can optimize our work. That does something to our brains. I'll show you what it is.
In reality there are several value systems and this also applies to data science. Especially if you change from empirical research or software development to a data science business context. Ask yourself: How were you socialized? Do you have a background as an economist or social scientist? Are you a developer using TensorFlow? Are you a mathematician with linear algebra running through your veins? Have you taken a course on DataCamp or Udemy?
There are many angles of approach to the data science profession and they all shape you. Which brings us to the question: What is the honor of a data scientist? What values should a data scientist represent? What should WE stand for?
Bad Circumstances
So what separates a good data scientist from a great one, in my opinion, is how you deal with bad circumstances. And I don't mean the work environment, but rather something like incomplete data or a business goal that is not formally defined.
What do you do when things don't go as planned?
There are some attitudes that great data scientists bring to the table. So this article is more about the mindset and less about the analytical strategies that great data scientists use. (See my article for more on analytical strategies.)
Let's look at some socialization pathways i've gone through myself. I studied computer science and did my doctorate degree in economics.
Computer Science
With some applicants, I notice that their path of socialization gives them the wrong impression that you only have to use a few software packages in a memorized order to solve a problem. As if data science would be a the standard procedure or an algorithm itself. I can relate to that because I thought the same way at first.
Two thoughts on this:
Data science would only be a standard procedure if the circumstances were perfect. The business problem would be crystal clear. The data would be of good quality and enough to make good estimates. The data would also be well distributed and have a variance structure that allows us to have observed every combination of influencing factors before. Honestly, how often does this happen? Exactly, never!
This can often be observed in machine learning tutorials on YouTube. Someone shows how to prepare the data, how to build a tensorflow model and how to store and inspect the results. And then the tutorial is over. But this is where data science is just beginning. The tutorial often gives the wrong impression that data science is implementing scikit-learn or tensorflow. It's like saying a brush makes me a painter.
Even the godfathers of computer science do not see coding as a standard procedure - even if they have nothing to do with machine learning.
"Every good piece of code I have seen has some spark of inspiration - you cannot regulate that."
- Bjarne Stroustrup, Inventor of C++
When I switched from computer science (with machine learning experience) to economics, I was surprised. I had built a solution that was completely new and made things possible that were not possible before. A huge step forward! And yet everyone criticized it. As if "good" wasn't great enough. Now I know why...
Empirical science
If you studied economics, psychology, social sciences, medicine or pharmacology and worked with data, you probably belong to the class of empirical researchers. Empirical researchers are driven by Karl Popper's critical rationalism. The falsification of hypotheses is central to gaining knowledge. This affects two things: the culture and the data science solution itself.
I often observe that applicants from this background have an "all or nothing" attitude. "I'd rather do nothing than do something imperfect." Let me give you an example: When I took part in the Data Mining Cup (a kind of unofficial student world championship in predictive modeling) in 2012, the 10 best data science teams were invited to Berlin to present their solutions.
A group of presenters complained that the data was not sufficient for the desired prediction. (A charge that one often has to hear in empirical research.) Others saw this as a challenge and made much better predictions - in orders of magnitude. They used a trick to compensate for the difficult circumstances: Instead of calculating 500 models for 500 products, they calculated one model that predicted the %-fluctuations of all 500 products. In this way, they withhold the model trivial information, such as the overall sales level of the product, and let it focus on the complex relationships. After estimating the %-deviation, the sales level of the products was multiplied back in. In this way, the products borrowed information from each other and it was possible to make much better forecasts despite short sales histories. (For more on such considerations, please read my article on analytical strategies.)
Don't get me wrong here: When it comes to the general advance of knowledge (especially when human lives depend on it), the approach of empirical research is essential. However, as data scientists we are in a different value system in the business context. The "proof" of causality and a critical approach is still important. However, the main focus now is to improve the status quo. This doesn't have to be a contradiction, but it does make a difference when assessing goodness. Here's an example:
领英推荐
We may not always be able to calculate the perfect sales forecast that meets some arbitrary goodness of fit criteria. But what is the alternative? Randomly guessing? This is not polemic. You can actually estimate the monetary effect of a data science solution and compare it with the alternatives "random" and "do nothing". In other words, how much money can you save or make the company? In my opinion, this should be the benchmark for a data science solution rather than arbitrary quality criteria. (See my upcoming article on business sense for further information.)
Fun fact, speaking of arbitrary quality criteria: Did you know that the commonly used 5% confidence level came from a citation error?
Some advice
So what would be my tips if you want to start a data science career? What values should we pursue as data scientist in a business context?
We focus on improvement ...
... rather on what is scientifically "right". This becomes an issue esp. when the circumstances are bad.
We simply make things better than they were before. Short sentence. Huge difference.
We work agile
And by that I don't mean necessarily scrum and kanban. Working in small steps is also a form of risk management. Much of what we do is highly risky and can sometimes go wrong. "What if I make a mistake?" you may ask yourself.
Whether something could work, can often be determined with just 5% of the effort required for a 100% solution. Use some simulated data. Ask a few experts to get a proxy for ground truth. Whatever it takes. Overcome those bad circumstances with dirty tricks. If it works, make some more effort, and so on...
When we calculate a PCA, we are happy if we can explain 80% of the variance with 5% of the factors. Why do we like to use this principle in statistics but tend to reject it for our own purposes and strive directly for the ideal solution?
Data science is not about programming...
... or building dashboards. Chemistry is not about test tubes and astronomy is not about telescopes. One of many ways to look at it is: Data science is about understanding statistical and contextual relationships in data and using them to the advantage of the company.
We learn from each other
Many data scientists often feel the expectation that you have to know everything and be able to do everything. When some data scientists then encounter bad circumstances, it may sometimes be easier to say "it can't be done" instead of "I have no idea how this could work". But I think this expectation is often self-imposed. In reality, business is complicated. Data science itself is already complicated. Nobody expects an "all-rounder".
We have discussed ways of socializing paths for data scientists. Being aware of this is a great chance to learn from each other.
We don't give up that easily
Good data science starts where good circumstances end :)
Thanks for making it this far. :)
Other posts of this series:
?? Head of Analytics @ REWE Group | ???? Retail Captain hooked on AI
2 年See here for Part 1 "Analytical Strategy" https://www.dhirubhai.net/posts/dr-julian-mennen%C3%B6h-3093a45_like-share-comment-activity-6961980243708362752-p7HN?utm_source=share&utm_medium=member_desktop