登录查看更多内容

What makes you a great Data Scientist? Part 2: Bad Circumstances

Dr. Julian Mennen?h

?? Head of Analytics @ REWE Group | ???? Retail Captain hooked on AI

发布日期: 2022年10月18日

A matter of honor...

... is the back-translated title of the movie "A Few Good Men". Even though Germany is notoriously known for translating film titles into German in a horrible way, I think that the German title fits better in this case. The film is about which values we should stand for when we move in more than one value system. What is right in one value system (e.g. military life) may be wrong in another value system (e.g. civil life). Does that make you feel uncomfortable? It should. As data scientists, we often take the goal function for granted. We optimize the error of a model or the cost function of a business problem. We often think that it always has to be like this: There is only one goal towards which we can optimize our work. That does something to our brains. I'll show you what it is.

In reality there are several value systems and this also applies to data science. Especially if you change from empirical research or software development to a data science business context. Ask yourself: How were you socialized? Do you have a background as an economist or social scientist? Are you a developer using TensorFlow? Are you a mathematician with linear algebra running through your veins? Have you taken a course on DataCamp or Udemy?

There are many angles of approach to the data science profession and they all shape you. Which brings us to the question: What is the honor of a data scientist? What values should a data scientist represent? What should WE stand for?

Bad Circumstances

So what separates a good data scientist from a great one, in my opinion, is how you deal with bad circumstances. And I don't mean the work environment, but rather something like incomplete data or a business goal that is not formally defined.

What do you do when things don't go as planned?

There are some attitudes that great data scientists bring to the table. So this article is more about the mindset and less about the analytical strategies that great data scientists use. (See my article for more on analytical strategies.)

Let's look at some socialization pathways i've gone through myself. I studied computer science and did my doctorate degree in economics.

Computer Science

With some applicants, I notice that their path of socialization gives them the wrong impression that you only have to use a few software packages in a memorized order to solve a problem. As if data science would be a the standard procedure or an algorithm itself. I can relate to that because I thought the same way at first.

Two thoughts on this:

Many roads lead to Rome. And some are much faster or way less bumpy. You have to think about your analytical strategies and the variance structure in the data. What packages to use is just the second step.
"Proof" by contradiction: IF data science would be essentially "just" a standard procedure of applying software packages, THEN data science could be automated and your job would be in jeopardy. There are already great packages for this, such as?PyCaret?by?Moez Ali. So you should definitely get rid of this idea. I think about it as follows: Packages like?PyCaret?are great because they relieve the data scientist of boilerplate code and focus more on actually working with what the data represents. Ask yourself: How can my human intelligence help make artificial intelligence better?

Data science would only be a standard procedure if the circumstances were perfect. The business problem would be crystal clear. The data would be of good quality and enough to make good estimates. The data would also be well distributed and have a variance structure that allows us to have observed every combination of influencing factors before. Honestly, how often does this happen? Exactly, never!

This can often be observed in machine learning tutorials on YouTube. Someone shows how to prepare the data, how to build a tensorflow model and how to store and inspect the results. And then the tutorial is over. But this is where data science is just beginning. The tutorial often gives the wrong impression that data science is implementing scikit-learn or tensorflow. It's like saying a brush makes me a painter.

Even the godfathers of computer science do not see coding as a standard procedure - even if they have nothing to do with machine learning.

"Every good piece of code I have seen has some spark of inspiration - you cannot regulate that."

- Bjarne Stroustrup, Inventor of C++

When I switched from computer science (with machine learning experience) to economics, I was surprised. I had built a solution that was completely new and made things possible that were not possible before. A huge step forward! And yet everyone criticized it. As if "good" wasn't great enough. Now I know why...

Empirical science

If you studied economics, psychology, social sciences, medicine or pharmacology and worked with data, you probably belong to the class of empirical researchers. Empirical researchers are driven by Karl Popper's critical rationalism. The falsification of hypotheses is central to gaining knowledge. This affects two things: the culture and the data science solution itself.

The culture among empirical researchers is characterized by criticism. Everyone criticizes each other, which ensures the progress of the knowledge for the entire research community. And that is good and right for empirical researchers. (However, if this principle was elevated to a corporate culture, I would describe it as somewhat toxic. ;)
Furthermore, this critical culture leads to a competitive environment between researchers or research groups. On the one hand, this creates the implicit expectation of having to know everything (better). On the other hand, results are only published if they are absolutely "bulletproof".

I often observe that applicants from this background have an "all or nothing" attitude. "I'd rather do nothing than do something imperfect." Let me give you an example: When I took part in the Data Mining Cup (a kind of unofficial student world championship in predictive modeling) in 2012, the 10 best data science teams were invited to Berlin to present their solutions.

A group of presenters complained that the data was not sufficient for the desired prediction. (A charge that one often has to hear in empirical research.) Others saw this as a challenge and made much better predictions - in orders of magnitude. They used a trick to compensate for the difficult circumstances: Instead of calculating 500 models for 500 products, they calculated one model that predicted the %-fluctuations of all 500 products. In this way, they withhold the model trivial information, such as the overall sales level of the product, and let it focus on the complex relationships. After estimating the %-deviation, the sales level of the products was multiplied back in. In this way, the products borrowed information from each other and it was possible to make much better forecasts despite short sales histories. (For more on such considerations, please read my article on analytical strategies.)

Don't get me wrong here: When it comes to the general advance of knowledge (especially when human lives depend on it), the approach of empirical research is essential. However, as data scientists we are in a different value system in the business context. The "proof" of causality and a critical approach is still important. However, the main focus now is to improve the status quo. This doesn't have to be a contradiction, but it does make a difference when assessing goodness. Here's an example:

领英推荐

Unlocking Your Data Science Career: Skills…

Data & Analytics 2 个月前

Data Science Myths Debunked: What Every Aspirant…

AAFT 1 个月前

Thinking about making the shift to data science?

Maven Analytics 10 个月前

We may not always be able to calculate the perfect sales forecast that meets some arbitrary goodness of fit criteria. But what is the alternative? Randomly guessing? This is not polemic. You can actually estimate the monetary effect of a data science solution and compare it with the alternatives "random" and "do nothing". In other words, how much money can you save or make the company? In my opinion, this should be the benchmark for a data science solution rather than arbitrary quality criteria. (See my upcoming article on business sense for further information.)

Fun fact, speaking of arbitrary quality criteria: Did you know that the commonly used 5% confidence level came from a citation error?

Some advice

So what would be my tips if you want to start a data science career? What values should we pursue as data scientist in a business context?

We focus on improvement ...

... rather on what is scientifically "right". This becomes an issue esp. when the circumstances are bad.

We simply make things better than they were before. Short sentence. Huge difference.

We work agile

And by that I don't mean necessarily scrum and kanban. Working in small steps is also a form of risk management. Much of what we do is highly risky and can sometimes go wrong. "What if I make a mistake?" you may ask yourself.

Don't: Try to work out the solution until it is "bulletproof".
Do: Dirty prototyping.

Whether something could work, can often be determined with just 5% of the effort required for a 100% solution. Use some simulated data. Ask a few experts to get a proxy for ground truth. Whatever it takes. Overcome those bad circumstances with dirty tricks. If it works, make some more effort, and so on...

When we calculate a PCA, we are happy if we can explain 80% of the variance with 5% of the factors. Why do we like to use this principle in statistics but tend to reject it for our own purposes and strive directly for the ideal solution?

Data science is not about programming...

... or building dashboards. Chemistry is not about test tubes and astronomy is not about telescopes. One of many ways to look at it is: Data science is about understanding statistical and contextual relationships in data and using them to the advantage of the company.

We learn from each other

Many data scientists often feel the expectation that you have to know everything and be able to do everything. When some data scientists then encounter bad circumstances, it may sometimes be easier to say "it can't be done" instead of "I have no idea how this could work". But I think this expectation is often self-imposed. In reality, business is complicated. Data science itself is already complicated. Nobody expects an "all-rounder".

If you are inexperienced, ask experienced data scientists. But don't ask what to do. Ask HOW they work.
If you're experienced, tell stuff. But don't tell selectively what you know better. Rather let others benefit from your experiences, failures and insights.

We have discussed ways of socializing paths for data scientists. Being aware of this is a great chance to learn from each other.

We don't give up that easily

Good data science starts where good circumstances end :)

Thanks for making it this far. :)

Dr. Julian Mennen?h的更多文章

What makes you a great Data Scientist? (I assume you already are a good one.) Part 1: The Analytical Strategy [9 min read]

2022年8月7日

What makes you a great Data Scientist? (I assume you already are a good one.) Part 1: The Analytical Strategy [9 min read]

You want to ruin the entire company? Excellent! Then you better not to use an analytical strategy. [a company] had to…

5 条评论

What makes you a great Data Scientist? Part 2: Bad Circumstances

Dr. Julian Mennen?h

?? Head of Analytics @ REWE Group | ???? Retail Captain hooked on AI

A matter of honor...

Bad Circumstances

Computer Science

Empirical science

领英推荐

Some advice

We focus on improvement ...

We work agile

Data science is not about programming...

We learn from each other

We don't give up that easily

Other posts of this series:

Dr. Julian Mennen?h的更多文章

社区洞察

其他会员也浏览了

Data Science: A Life-Changing Asset in Today’s Scenario!

Statistics for Data Science: Your Gateway to Unlocking Insights

What are the most in-demand skills in data science?

Think You Know Data Science? 5 Myths To Challenge

How to Build a Career in Data Science: Tips and Advice

How to Be a Great Data Scientist: 21 Principles for Data Scientists and Leaders in the Age of LLMs

Mastering Data Science: Reasons, Challenges and Solutions

Best Institute for Data Science

Demystifying the Data Deluge: Your Journey to Data Scientist

I Learned Data Science From Scratch The Ultimate Data Science Roadmap for Beginners

A matter of honor...

Bad Circumstances

Computer Science

Empirical science

领英推荐

Some advice

We focus on improvement ...

We work agile

Data science is not about programming...

We learn from each other

We don't give up that easily

Other posts of this series:

Dr. Julian Mennen?h的更多文章

What makes you a great Data Scientist? (I assume you already are a good one.) Part 1: The Analytical Strategy [9 min read]

社区洞察

其他会员也浏览了

Data Science: A Life-Changing Asset in Today’s Scenario!

Statistics for Data Science: Your Gateway to Unlocking Insights

What are the most in-demand skills in data science?

Think You Know Data Science? 5 Myths To Challenge

How to Build a Career in Data Science: Tips and Advice

How to Be a Great Data Scientist: 21 Principles for Data Scientists and Leaders in the Age of LLMs

Mastering Data Science: Reasons, Challenges and Solutions

Best Institute for Data Science

Demystifying the Data Deluge: Your Journey to Data Scientist

I Learned Data Science From Scratch The Ultimate Data Science Roadmap for Beginners