You have to fall in love with the Insights not with the Models (or with Coding)
Gradient descent process

You have to fall in love with the Insights not with the Models (or with Coding)

"It is essential to remember that when it comes to data science, the goal should not be to fall in love with the models or coding, but instead to fall in love with the insights that can be gained from the data. Models and coding are simply tools that allow us to gain those insights, so it is important to focus on the end goal of uncovering useful information and knowledge from the data."?This was written entirely by AI. With openai.com

As in all data analysis, context is important. Weekend, Madrid, tapas, ca?as, nerds, and talks about how data science is changing. The interesting thing is that there was a cross-section of people. Engineers, statisticians, economists, management, human resources, and data science outliers. The geographic distribution was also quite well distributed, at least between Latin Americans and Europeans. So some conclusions with statistical weight, at least in the set of friends, we could draw.

So, I share some conclusions that I can draw, and that really worried me:

  1. There are people who have a lot of love for coding and not so much for the problem they want to solve.
  2. There are people who want to use the latest algorithm that was discovered at MIT, Stanford, Google, in a 5-thousand-year sector, with a company′s culture from the XX century.
  3. There are some people who love the model more than solving a decision-making problem.
  4. Some people ask you what library you use, before asking what you want to solve.
  5. There are some people who prefer to use XGBoost, because they read the last technical post, than another model that may have a little less accuracy but it is possible to deploy much faster.
  6. There are people who code when their ideal is to get that code to do something similar (and generally worse) to Excel, SAS or SPSS.
  7. There are people who pay a lot of attention to the technological infrastructure and forget that business results are needed. If there is no revenue, it is difficult for costs to increase.
  8. There are people who are more afraid of lowcode than of Freddy Krueger.
  9. There are people who launch products as if they were unique. The example of Quantumblack (McKinsey AI) with its CausalNex (here), based on the Google library on CausalImpact (here). In 90%, they do the same.

Obviously, the previous comments are biased. It wasn't all bad news, but it does worry me. Data science has been around for a long time and has always been about finding patterns and giving decision makers facts. In fact, some decisions are so simple and routinary that they could be systematized with prescriptive analytics. And only use the decision-making time in those ad hoc or that require more creativity.

Coding is not the focus of the topic. Comparing models is not the key to data science. They ask you what model do you use? have you used XGBoost for modelling? What libraries do you use? I think it's a huge waste of time. And that is not a predictor of anything. And if I tell you that I use SPSS Modeler and that when loading the database, defining the target, it automatically recommends all possible models. I press a run and it gives me a report with the performance of each model. Is that a data scientist? What if I do the same thing with Python, and the result is the same, am I a better data scientist?

Look at the models that are used depending on where you work (industry, academia or research).

No hay texto alternativo para esta imagen
Source: KDnuggets & Forrester

The problem is that many people who work in the industry want to use the models that those working in the academia use. And I'm not saying that people who work in the industry don't innovate based on data science (or in models), what I'm saying is that innovations in the industry are in the 4Ds that I raised in this post (here). Designing the problem in an innovative way (churn , default, etc. is not innovative). Define what Data you are going to use. Are you going to use differentiated, alternative, complementary data, or will you use the company data (biased) and add the data from the yellow pages?

Regarding Development , how are you going to develop the algorithms. Today in the industry everything is more or less within a fairly small margin. Believe me, in the last 3 or 4 years I have developed a number of models and deployed them, all based on code (mainly in R, because the need for more statistical power was important).

At an academic level, for my McS thesis in Statistics I developed 5 different survival models to see how they performed. My PhD thesis I used the Dif in Dif model to analyze the impact of (tax) incentives on investment decisions. Use autoregressive moving average to understand the behavior of Covid in Uruguay. I used ANN, RNN and CNN to develop an Income Predictor for the entire population of Uruguay. I used ANN and MLR to understand the propensity of several clients of a financial institutions (+200k clients). I used MLR to be able to infer the price of a head of cattle in auction processes. Use CausalImpact to find out if the UK government change in October had a major impact on the pound (here). And believe me, I can go on.

In fact, I leave here a comparison of models to understand how to predict "if a stock was going to have dividends or not". I leave it here. All the code. It's free. Use it. No company is going to generate competitive advantage based on this code, but if you are a SME and need my help, send me a DM and I′ll help you for free.

No hay texto alternativo para esta imagen
Source: own development in R
No hay texto alternativo para esta imagen
Source: own development in R
No hay texto alternativo para esta imagen
Source: own development in R
No hay texto alternativo para esta imagen
Source: own development in R & Excel
?Innovations in data science are not in how many libraries you use. Or if you use Python. Or whether you code or use lowcode. In whether you code or use SAS, SPSS, Excel, it's somewhere else.

It is in understanding that you have problems if you conclude without considering ergodicity, without knowing what is moral hazard, adverse selection, without understanding that you cannot model a chaotic experiment based on Bayes, in not knowing what Entropy implies for a data base, in not understanding that information asymmetry can be seen from different perspectives (as George A. Akerlof, A. Michael Spence & Joseph E. Stiglitz did), in developing nested models capable of telling us if the WHO (people or companies) will WHAT (propensity , origination , default , collection , churn , etc.) but also know WHEN they will, not knowing that within the value theory we have at least 3 stages: generating, appropriating and distributing value. Within another number of concepts that make data science in business (in others fields the knowledge are different, but the concept is identical).

And the truth is that this has nothing to do with Python/R libraries, it has to do with creativity, with the evolution of a discipline that is based on being able to find patterns, to generate insights, to make better decisions, to optimize the theory of value, and to be able to generate dynamic and sustainable competitive advantages.

Luis Ojeda

BI | Data Vizualization | Customer experience

1 年

Gracias Diego, por un artículo tan interesante.

Macarena Estévez

??? Passionate Speaker and Strategic Advisor in AI, Data, Trends, Metaverse, Future of Marketing and Work. ?? LinkedIn TopVoice ?? TEDx Writer and Thinker. #Data&AI #Metaverse #ROI #FutureOfMarketing #FutureOfWork

1 年

要查看或添加评论,请登录

Diego Vallarino, PhD (he/him)的更多文章

社区洞察

其他会员也浏览了