Abstraction and Data Science - Not a great combination
Venkat Raman
Co-Founder & CEO at Aryma Labs | Building Marketing ROI Solutions For a Privacy First Era | Statistician |
Abstraction - some succinct definitions.
“Abstraction is the technique of hiding implementation by providing a layer over the functionality.”
"Abstraction, as a process, denotes the extracting of the essential details about an item, or a group of items, while ignoring the inessential details"
"Abstraction - Its main goal is to handle complexity by hiding unnecessary details from the user"
Abstraction as a concept and implementation in software engineering is good. But when extended to Data Science and overdone, becomes dangerous.
Recently, the issue of sklearn’s default L2 penalty in its logistic regression algorithm came up again.
This issue was first discovered in 2019 by Zachary Lipton.?
On the same issue, an excellent blog titled ‘Scikit-learn’s Defaults are wrong’ was written by W.D.?Here is the link to that article .
This article IMHO is a must read for any serious Data Scientist.
While the author of the article has excellently captured the design pattern flaws, I would just like to build on it and add the problem of ‘Too much abstraction’.
In one of my recent posts , I had highlighted how abstracting away GLM in sklearn’s logistic regression makes large number of people believe that Regression in Logistic Regression is merely a misnomer and it has nothing to do with Regression!!?
Below is an image from that article highlighting the issue.
So, why is ‘Too much abstraction’ a problem in Data Science?
I took the liberty of modifying Francois chollets famous diagram on difference between traditional programming and ML to drive home some important points regarding ‘too much abstraction’.
Firstly, in normal programming, if you do abstraction, you just abstract away the fixed rules. This works out fine in software development realm as you don’t want certain people to tinker around ‘fixed rules’ or they simply don’t care ‘how things work under the hood’.
But in Data science, if you do too much abstraction, you are also abstracting away the intuition of how the algorithm works and most importantly you are hiding away the knobs and levers necessary to tweak the model.
In Data science, if you do 'too much abstraction', you are also abstracting away the intuition of how the algorithm works and most importantly you are hiding away the knobs and levers necessary to tweak the model.
Let’s not forget that the role of data scientist is to develop intuition of how the algorithms works and then tweak the relevant knobs/ levers to make the model a right fit to solve business problem.
Taking this away from Data Scientists is just counter intuitive.
These aside, there are other pertinent questions on ‘too much abstraction’.
Let’s revisit one of the Abstraction definitions from above: “Abstraction - Its main goal is to handle complexity by hiding unnecessary details from the user.”
领英推荐
When it comes to data science libraries or low code solutions, the question arises who decides ‘what is unnecessary’? Who decides which knobs and levers a user can or can’t see and tweak?
Are the people making these decisions well trained in Statistics and machine learning concepts? or are they coming from a purely programming background? ?
In this regard I can’t help but loan some apt excerpts from W.D ‘s article?"One of the more common concerns you’ll hear–not only from formally trained statisticians, but also DS and ML practitioners–is that many people being churned through boot camps and other CS/DS programs respect neither statistics nor general good practices for data management".
On the user side in Data Science, here are the perils of using libraries or low code solutions with ‘too much abstraction’.
The dangers of doing Data Science wrongly just becomes that much exacerbated. Not to mention ‘You don’t need math for ML’ and ‘Try all models’ kind of articles encouraging people to do data science without much diligence. Any guesses for what could go wrong ?
Data science is not some poem that it can be interpreted in any which way. There is a definitive right and wrong way to do data science and implement data science solutions.
Also, Data Science is just not about predictions. How these predictions are made and what ingredients led to those predictions also matter a lot. ‘Too much abstraction’ abstracts out these important parts too.
Read the Documentation
Coming to defense of these ‘too much abstracted’ libraries and solutions, some remark the user should ‘Read the documentation carefully and in detail’.
Well not many have the time and most importantly some low code solutions and libraries are sold on the idea of ‘Perform ML in 2-3 lines of code’ or ‘Do modelling faster’.
So again, referencing W.D, ‘read the doc is a cop-out’. Especially if it comes from low code solution providers.
A Bigger Problem to Ponder Upon
Having said all this, Sklearn is still by and large a good library for Machine Learning. The problem of L2 default might be one of the very few flaws.
However, I would urge the readers to ponder over this:
If abstracting away some details in one ML algorithm could cause so much issues, imagine what abstracting away details from dozen or so ML algorithms in a single line of code could result in. Some low code libraries do exactly that.
If abstracting away some details in one ML algorithm could cause so much issues, imagine what abstracting away details from dozen or so ML algorithms in a single line of code could result in !
I am not against abstraction or automation per say. My concern is only with ‘too much abstraction’ . And I don’t have a concrete answer for how to tackle ‘too much abstraction’ in data science libraries. One can only wonder if there is even a middle ground.
But one thing is very clear. The issues of ‘too much abstraction’ in Data Science are real.
The more one abstracts away, the more is the chance of doing data science wrongly.
Perhaps all we can do is, be wary of low code solutions and libraries. Caveat emptor.?
?
Data Analyst/Data Science
1 年Cloud platform like AWS ,Azure are reigning for years . Why are people not argue about them ,why are people not asking if it is black box or so. Why saying low code are bad ? Why aren't people complaining about those cloud platform I mention earlier? People just love arguing . It is like saying Manuel car are better than automatic car . We data sceintist need to wake up before it is too late . There is nothing wrong with trying no and low code open source . If they can't use it dosnt mean they should discourage other from using it . I can assure you that pycaret is 100% tested and trusted.
Experienced Technologist & Strategic Leader | Data Science, Software Engineering & Cybersecurity | Driving Innovation & Growth
2 年There are useful abstractions that make data science easier. I don’t think it is a simple as “abstractions are always bad”. Useful abstractions include those provided by SaaS vendors that hide the physical details of file storage and access, as well as other operating system details, are useful and save time. Similarly, programming language constructs such as Pandas in Python provide building blocks that hide details of memory allocation and other areas that make the job easier than coding in pure C. I remember the days of buying servers, waiting for them to be setup, then learning Linux/UNIX in order to create your first basic model. Notebooks provide a faster means to get going now. It is true that you cannot hide all the details, but it is also true that you can avoid some details and still be very successful with Data Science. The question is “which ones”?
I am a physicist specializing in AI, ML, data science, engineering, and analytics. With expertise across industries, I combine physics passion and tech knowledge to deliver innovative R&D projects.
3 年Indeed. Working on abstractions is a domain of physics and other sciences to try to cover many different phenomena by one law of science. Machine Learning works on particular cases and doesn't understand what is abstraction.
Product, Technology and assortment analytics for 6.5 years | Kaggle Grandmaster | IIT Kanpur
3 年Low code libraries are bad if you don't know what's going on. Otherwise it's a good way to build quick prototypes and baseline models. You do not need to know the basics of a microwave based heat coil to operate a microwave, but you do need to understand how the parameters ( temperature,wattage,heat placement etc) affect the final result. Source:I've worked with 10+ low code libraries, and most of these libraries are developed by Kaggle masters and Grandmasters who know what they're doing, as well as corporates. A few like Datarobot and MLJAR even have domain specific custom metrics built in. I think of it as borrowing their skill for a baseline and building from there instead of treating it as a black box.,EvalML
Data Scientist | Microsoft Certified | Analista de datos | Senior @ Interbank
3 年I always insist in the necessity of reading the documentation and understanding the theory behind before applying any ML library. There are always assumptions and requirements we need to check before using any of these. Thanks for such great article!