登录查看更多内容

Machine learning worst enemies; Mickey Mice

Ahmed B. Moharram

Senior Data Scientist - Fraud Prevention | Data Science

发布日期: 2019年2月19日

I know this is strange, but I really recommend that you see this short story of Mickey Mouse before you continue reading, yet you can read the article first, but you will lose much of the fun (It is not the full movie, it is just some part of it to remind you of the full story).

When we try to think of the enemy of machine learning and artificial intelligence, many of us will say ‘old style scientists who refuse machine learning’, or ‘those who are being afraid that ML and AI will let computers rule the world’.

A physics scientist who saw amazing results produced by machine learning will look at the ceiling and start saying; ‘my problem is, I cannot know how the ML reached to this specific solution, and is it the only possible solution?? These results were reached after thousands of thousands of iterations that were not tracked or logged, and even if they are logged, it will take us decades to revise the process. Even then, we will not be sure what will happen if we re-run the same model again. And using ML solutions to monitor the ML models in real time will hide a lot of the actual solution evolution process.’

Some practitioners who were optimizing a model just before they stumble by this article will promptly reply loudly (with a bit of anger .. or more) ‘the lack of scientific mechanism to decide about arbitrary model parameters’, and just then some AutoML geeks will stand up at the end of the room, they will start to speak altogether but you can identify that they said, ‘hey, we have solved this problem already, now, we have intelligent systems that can decide about the optimal parameters to optimize the model for any given data, and it is working perfectly’; yet no one can define the meaning of 'perfectly'.

At this very moment, someone pops up almost out of nowhere but he was sitting just behind the AutoML guys and he says, ‘machine learning cannot replace data scientists … ‘; then whispers ‘ … but can replace almost any other professional !!!’

Someone else who was using machine learning to forecast the market prices for some big online store products speaks calmly ‘our biggest enemy is the lack of realistic features designers’ or maybe he will say ‘overfitting, if we can just have a line that identifies overfitting caused by incorrect sampling from tectonic market shifts’.

A lot of other people will start to say more reasons and at this moment you will have the feeling that almost everything is an enemy of ML. And maybe I will have time later to discuss every argument of these in a separate article.

And they are all right to different extents. But the worst enemy of machine learning is bad data-scientists, and bad-business owners.

Data scientists who are just repeating the same codes trying to build models that has a high level of accuracy without even understanding their data or the context. I still remember when Jake VanderPlas; one of the best data science writers; concluded to a high difference between the average father and mother ages because he did not know that the data entry guys insert 99 as the father age if he is not known. He introduced this as a funny example in one of his lectures to encourage data scientists to explore and understand their data before starting to build models.

I even know some data scientists who believe more about data science from media and fluffy news than from the real scientific background. They do not only believe in myths, they are disseminating these myths in the community.

You should not think that I am underestimating ML, not at all. Those who think machine learning has matured are just under the influence of the great achievement attained by ML around the world, but the day is still too young. ML still has a lot of problems to solve, we have just opened a magic tomb and still discovering it. Happiness about some magic sticks here and there does not mean we have fully understood the real magic, and it is coming (you can re-watch Mickey Mouse Disney's Fantasia on youtube -yes, I love mickey mouse). The amazing results which are coming out from ML models have a lot of real effort and understanding behind them, not only from data scientists but from field scientists and experts as well. And till now, machines are not supposed to write new science theories, they are just tools that can help the world if it is well-understood first.

Lazy (and sometimes novice) data scientists are feeding the community with wrong expectations and half-truths of the nature of machine learning process and limitations; and this is the ML worst enemy.

Some data scientists are supposed to help scientists from other fields like physics, medicine, healthcare, hospitality, marketing, etc. And these scientists started to find out the inconsistencies of ML outputs in some cases. This does not void the vast number of real amazing success stories, but may shake the confidence of the scientific community in ML outputs forever. This will deprive a considerable number of scientists from utilizing the mature versions of ML when it becomes too late to have a first good impression. You can read Genevera Allen “Can we trust scientific discoveries made using machine learning?” to discover the bad impact (of course the scientific community should also do more to know better, and I guess this is one part of Genevera message, but I am here focusing on the message that we should get from her article).

It is the data scientists’ role to introduce clear information about the difference between the science and the implementation. We should introduce the precautions and the limitations along with the models and their results. People who are using the outputs of ML models should know how far they can utilize these outputs and should also know the risks.

A very simple example of the half-truths usually introduced to the community is the overfitting meaning. Most writers will just tell the community that if the model accuracy is %100 or %99, this means the model is overfitted. And to make it clear, they say that the model will depend more on remembering the training data than extracting the meta-data of patterns and behaviours, and this is not wrong; yet incomplete.

In reality, if the model is not right (if it is does not have the right configuration, the right optimization technique, the right encoding, the right number of steps, or any other wrong parameter, it will be overfitting even if we have only %90, because over fitting is totally relative). Many data scientists, especially new ones, come to you happily 4 hours after a task assignment and tell you they have reached %98 accuracy. But they stand still, when you feed their models with fully random inputs and it still produces like 80% accuracy. In such a case a wrong decision with encoding along with mal-distributed training-data created an initially biased model no matter how much the accuracy tells you, it is all fake. And when you reach %90 may be this is an overfitting for your specific case; yet they feel quite satisfied because it is not %100.

Even if the the data, the encoding and the evaluation functions are all correct, but your model (which also has all the suitable components or layers) is higher in dimensionality (having something extra, something that is not supposed to exist) than the original problem, you will find solutions very quickly, but they cannot be applied or at least cannot be reproduced in real life. Think like this, you need to eat, you have to build a right model to let you eat. But instead of building a process that includes "work, then getting money, then buying food ingredients then preparing it and eating it", you found -by chance- some neighbor who left some food on the table in his garden; and you found it an optimal solution -now- to pass by while he is sitting, and he will definitely invite you (out of politeness), so you will accept the invitation and eat. But trust me, he will never invite you every time. So, your model is wrong, you have built it using a new unreliable short-life dimension.

One other example appears in prices forecasting models process. Some data scientists will not pay attention that they are targeting moving objectives that are changing their speed over time. They will start building models using the same code they learnt and used many times, they will presume that everything is stable, they will not bother themselves splitting the market change speed as one of the parameters (meta-features to be accurate), they will not deal with market tectonic shifts (and their impact on the homogeneity of the time series). They will just do the job, and use higher dimensions that will swallow the real traits of the context and build skewed world that has nothing in common with the real world. They are taking the role of Mickey Mouse, but in real world, you can imagine the disaster or just watch a comic sample in the movie.

The data science world has a lot of these examples which are misunderstood by the community, or mis-used by the data scientists. I think, these mistakes, misunderstanding, and misuse are the real enemy of ML.

The power of ML comes at a price, and requires appropriate data in terms of both quantity and quality. ML power by itself is like a rocket, but it is a disaster if you launch it without an understanding of the destination, the path, the obstacles, and the objectives. And inside the ML community, we need to build the algorithm-psychiatry models to correct algorithms behavior with data of specific distributions, and this will be another article, but remember you read it here first :) :)

Ahmed Moharram; Berlin Feb. 2019

Rasha Fahim

Executive Director - Egypt Textiles & Home Textiles Export Council (THTEC)

5 年

"these mistakes, misunderstanding, and misuse are the real enemy of ML" well said ??

查看更多评论

要查看或添加评论，请登录

Ahmed B. Moharram的更多文章

Time Series Analysis in Manufacturing

2023年5月28日

Time Series Analysis in Manufacturing

Time series analysis refers to the systematic examination of a sequence of data points, each associated with a specific…

2 条评论
A little bit more details on pre-processed genetic material for Genetic Algorithms acceleration

2018年5月29日

A little bit more details on pre-processed genetic material for Genetic Algorithms acceleration

The question: Can Genetic Algorithms be used for real-time autonomous control? The quick answer is; Yes, and some…
My roots ...

2018年4月30日

My roots ...

Just a few moments ago, I had an epiphany about my longstanding passion that has always existed in the depths of my…
Today advice

2018年4月26日

Today advice

As a data scientist, you have to know that people awareness (some times including you organisation management) of the…
To succeed as a data scientist

2018年4月20日

To succeed as a data scientist

Some of my dear students, younger (and brighter) team-members :) and colleagues ask me how to succeed as a data…

1 条评论

See all articles

Machine learning worst enemies; Mickey Mice

Ahmed B. Moharram

Senior Data Scientist - Fraud Prevention | Data Science

Ahmed B. Moharram的更多文章

社区洞察

其他会员也浏览了

Bias variance tradeoff - a simple analogy

AI is from Venus, Machine Learning is from Mars

"Causal Fundamentalism": AI/ML/LLMs/GenAI/AGI/ASI/Robotics Fundamentals

Artificial Intelligence #59

Causality and Inference for Machine Learning

What is Creativity?

Comparing Machine Learning Models to Find the Best Fit

The Evolution of Predictive Analytics

How to Minimize Computational Cost and Time of Hyperparameters Tuning for Machine Learning Models

The Promised Land of Data: Analysing the Data Ecosystem in the Age of Machine Learning

Ahmed B. Moharram的更多文章

Time Series Analysis in Manufacturing

A little bit more details on pre-processed genetic material for Genetic Algorithms acceleration

My roots ...

Today advice

To succeed as a data scientist

社区洞察

其他会员也浏览了

Bias variance tradeoff - a simple analogy

AI is from Venus, Machine Learning is from Mars

"Causal Fundamentalism": AI/ML/LLMs/GenAI/AGI/ASI/Robotics Fundamentals

Artificial Intelligence #59

Causality and Inference for Machine Learning

What is Creativity?

Comparing Machine Learning Models to Find the Best Fit

The Evolution of Predictive Analytics

How to Minimize Computational Cost and Time of Hyperparameters Tuning for Machine Learning Models

The Promised Land of Data: Analysing the Data Ecosystem in the Age of Machine Learning