登录查看更多内容

An Evaluation of pycaret's 'Regression - Level Beginner'

Alice SH Wong

Data Science Decadenarian

发布日期: 2020年4月13日

LinkedIn seems aflush these days with demonstrations of pycaret and demonstrations of covid-19 prediction skills, with the two overlapping more than some might be comfortable with, especially given pycaret's very recent debut. While focusing on some of my major technical questions surrounding it and pointing out some obvious strengths and deficiencies, many cultural phenomena in the Data Science world will increasingly rear their ugly heads as we go along this analysis of pycaret. Before I launch into my evaluation, here's a disclaimer that pycaret is in its early days - any apparent deficiencies might simply be a result of my unfamiliarity with where its more esoteric documentation is online or the premature (over)use of the package by the Data Science community before pycaret is ready to fully complete its documentation or even methodological repertoire. Yet I feel compelled to publish a rather premature evaluation of pycaret as it is, simply because its use in some of our more serious industries or on some of our more widely read media like LinkedIn risks the proliferation of not fake news but the methodological equivalent of it. I'd be happy if I were wrong as the stakes are high, and no harm done if I were, but if I weren't, I'm hoping that this assessment, along with other cultural observations, will lend much-needed caution and responsibility to the Data Science community in the high-stakes endeavors of healthcare research especially.

1) Large yet limited range of models

pycaret boasts over 20 models, all of which can be compared at once with a single command. You've got to give them props for having something titled 'Passive Aggressive Regression' for marketing purposes at least. Some offshoots of very popular algorithms like ExtraTrees (from RandomForest) and Light GBM (from GBM) also provide variety subject to widespread theoretical knowledge. However, many of the models are quite similar to each other, with the limitation being that similar models tend to produce similar results as well as be used for similar purposes. An example of not just similarity but redundancy is the presence of Elastic Net, lasso regression and ridge regression, where setting the L1 or L2 to 0 for Elastic Net would effectively be picking a lasso or ridge regression.

There is nothing wrong with having similar models to produce the slight edge that is so pivotal in certain industries like adtech, but the bigger issue is that the range of models omits even the most commonly used ones at work and some of the most basic models one might learn in an advanced undergraduate or basic graduate Statistics class. There is no zero-inflated Poisson model for modeling very rare occurrences, no logistic regression (this was a trick comment - they classed it under 'Classification,' as if regressions weren't capable of classification), no quantile regression to deal with heavily skewed distributions of outcomes, no survival analysis to deal with time-to-event, censored data and no mixed effects models (otherwise known as mixed models or hierarchical models) to deal with hidden or latent effects of individual (group) trajectories. To be fair, the latter two should probably not be at the beginner's level, though it isn't at pycaret's intermediate level either. The expert level is still in the making, but if history repeats itself, disciplinary sidelining (by Pythonistas of Statistics, a discipline comprising half of the US' Data Scientists) might cause them to be missing for years, just as Google Camp's planned development in Python of mixed models and survival analysis models beyond their one canonical flavor has been stalling since 2015.

2) Opaque metrics

In 2016, I wrote a LinkedIn article lamenting the myths of Data Science, some of which are an assumed proficiency of all practitioners. Those were early days before the years went on to prove the quackery of some practitioners, the failure of expensive Data Science projects and the attendant all-but-bursting Data Science bubble. One of the key harbingers of sloppy Statistics (and thus Data Science) was - and still is - scikit-learn's default use of the R-squared as opposed to the adjusted R-squared, which penalizes for additional input factors which might inflate the generalizability of the results from the training data to unseen data. So many years have passed since Python's first real attempt at statistical modeling with scikit-learn's introduction, and it seems scant feedback has allowed such oversights to persist in pycaret, which also has the R-squared rather than the adjusted R-squared as the metric of choice alongside RMSE, MAE, MAPE, etc.

3) Fundamental Statistical Omissions

The cultural implications of pycaret, or at least the way it's made its splash into the Data Science world, are abundantly clear from the lack of any display of traditional (linear) regression coefficients or machine learning algorithms' variable importance measures such as RandomForest's or XGBoost's. At least, while scikit-learn outputs variable importance measures for RandomForest and XGBoost easily, pycaret has no option to do that, or Google Search might have personalized my search so vindictively as to make a fool of me here. Both pycaret and scikit-learn seem equally dismissive of the importance of regression coefficients, which aren't displayed at all in pycaret and which require an additional package to be downloaded for their display in scikit-learn. Base R, however, displays coefficients, standard errors, confidence intervals and p-values with a simple 'summary' statement.

This begs the question of why coefficients (and confidence intervals and significance levels) and other less standard variable importance measures have been neglected in certain Data Science cultural milieu, especially the Python-ML-engineering camps, when industries anywhere from healthcare to insurance to retail to adtech all crave this transparency of what input factors they can pull as levers to dial up or dial down certain outcomes. Perhaps these Python developers think they can't be everything to everyone and choose to focus on model fit and predictive accuracy instead of interpretability (and thus obvious actionability) - however, perhaps they have lost sight of empirical research's findings that interpretable models with standard coefficients can actually perform predictions as well as, if not better than, 'state-of-the-art' deep learning models in non-astronomical sample sizes. In my own industry experience, my most successful models for prediction itself have equally been traditional regression models as well as less interpretable machine learning models.

4) Opaque, suboptimal and missing preprocessing steps

Unsurprisingly perhaps, given the neglect of arguably the most widely used and versatile of all methodologies, the linear regression, all the canonical pre-modeling inspection steps to check for its canonical assumptions like normality of residuals and constancy of variance are missing, whereas R outputs quite a few of these diagnostic plots simply by calling 'plots' on the model object.

Again, I might chalk it down to Google Search's improbable vendetta against me (though why didn't I search from the back?), but I haven't found any documentation, even after looking at the 'Intermediate' version, on how the following preprocessing steps, among others, are achieved: removing outliers, numeric imputer, feature selection, feature interaction, and high cardinality method ('PCA' exists separately, so I assume it isn't that, though what happens if you turn off the high cardinality method but turn PCA on - wouldn't it be a contradiction?). As for removing multicollinearity, transforming target, and transformations (of input variables, I assume), I've caught a glimpse of their and pycaret's general mechanisms through looking at the intermediate version. It seems as though one doesn't have an option to choose how to transform variables, for example, but is at least shown what the method is after the model has run. I suppose each algorithm has a few transformation methods it tries and just picks the one producing the best output metrics (though as for which one it optimizes for, I'm not sure), or perhaps it has just one. Then there is the issue of potentially conflicting situations where we turn on 'feature selection' yet are using a lasso regression (which selects your features) or have the algorithm choose 'box-cox transformation' by itself while we have turned off 'polynomial features,' thus blocking the box-cox method from choosing any polynomial transformations.

Regarding the rare glimpses into preprocessing methods, quite a handful have given me pause for thought. I caught a glimpse of 'mean' being used as an imputation method. That is fair enough if we were handcoding from scratch and wanted some quick results, but the whole point of pycaret is to be efficient, and since it's already got k-nearest-neighbors and RandomForest built in, one wonders why they couldn't just use such methods for imputation. As for 'numeric binning,' the intermediate version reveals that Sturge's rule is being used. Sturge's rule is a one-liner whose formula depends only on the number of observations. Without knowing the formula, you'd probably already guess that there're much better methods for finding bins, especially if your purpose is to maximize your chances of finding a statistically significant variable or to enhance predictive accuracy. You could run a loop through various breaks in a variable and find the set of breaks that minimizes the p-value in a univariate test of its association with the outcome of interest. Even more simply, you could view scatterplots of each variable with the outcome to pick some breaks. The last preprocessing step I'll address is the removal of multicollinearity. The method hinted at to deal with it would not be any rigorous practitioner's first choice, which would be the VIF or GVIF. In the official pycaret tutorial, a multicollinearity threshold of .95 was output, hinting that a correlation instead of VIF, is being used. The issue with using correlations alone is that a correlation may be very high but meaningless (and not to be acted upon) if not statistically significant. This could occur if the sample sizes were too small. Think of how someone could bully you into saying that it's been hot for the past few days, fewer people are eating ice cream today, and therefore, that there's a full 1.0 correlation between warm weather and avoidance of ice cream. Statistical significance is so important a concept that any (treatment) effect reported in any serious scientific journal (not the likes of 'Nature' though it has its uses) would be considered moot if p-values or its corollary, confidence intervals, aren't reported. Yet the Python development world has given free rein to the display of correlations without significance levels, most notably seen in the ever-popular O'Reilly's 'Python for Data Analysis,' that pernicious bible for every self-making Data Scientist who might've omitted to consult more theoretically rigorous guides. Yet the emphasis on learning a programming language has superceded theory so much that sloppy practices are often overlooked, and in this case, practised, in pycaret. Don't get me wrong - programming is often an integral tool to Data Science unless pycaret makes some serious improvements for every programmophobe, but pure programmers without sufficient statistical background should restrict their lessons to pure programming and not try to cash in on the latest data analysis trend, not without first consulting people with an extensive statistical training at least.

Conclusion

pycaret, despite its hype, seems to be a pale version of scikit-learn, inheriting its current deficiencies as well as lagging behind in some respects. It's needless to point out its obvious triumph, the ability to test multiple algorithms all at once, even if each of these algorithms, owing to some missing and suboptimal preprocessing steps, might not even be optimized for predictive accuracy. Crucially, we must remember too that pycaret's focus seems to be predictive accuracy at the expense of interpretable variable importances and significances while there is typically no tradeoff between interpretability and predictive accuracy for interpretable models.

Addendum

For all practitioners who are using pycaret, caution is urged as there is no substitute for a rigorous theoretical (and I mean a comprehensive one including fundamentals, not just the trendy ones like 'deep learning') training complemented by the usual programming skills, which we apparently still need because pycaret isn't where we might want it to be yet. A revisit of all the deficiencies I've highlighted above might hint that what we take for granted as simple, such as a linear regression, is often the most tricky of all with hidden pitfalls, and not only will a convenient platform solution like pycaret not address such hidden traps, it can often blind us to the need to assiduously address modeling assumptions and the sheer meticulousness that goes into preprocessing. There have been many calls for 'non-experts' to refrain from covid-19 modeling especially and there must surely be some tension arising from it. I understand the compulsion to help out in a time like this, and instead of urging anyone who might not have much of a theoretical and domain-related background to refrain from getting knee-deep in (medical) healthcare research, I would instead urge you to get into it - and to take it so seriously as to either get a relevant degree or get your home-schooled but carefully curated version of it, and then try to gain work experience in the field thereafter. This latter option must be very mindfully curated, as the most trendy of Data Science courses like Coursera, for example, have traditionally not offered A/B testing. Yet A/B tests (not under that umbrella term but by specific tests like 'chi-square test') are prevalent in scientific journals in all their prerequisite glory, as the most rigorous of models may still run into multicollinearity problems where one variable might mask another's effect, while a univariate A/B test will show a variable's association with the outcome as it is. If you are keen to help out in healthcare, just do it, and make it a labor of love with a real, substantial sacrifice of your time for others and the self-reward of rigorous learning.

要查看或添加评论，请登录

Alice SH Wong的更多文章

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

2021年8月26日

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

1) Logistic regression (LR) is a regression. And yes, it's also a classifier, insofar as the predicted log odds is a…

26 条评论
Boundaries: Consistency over Levels?

2021年8月12日

Boundaries: Consistency over Levels?

I have to preface this by saying I have an instinctive cynicism about the word 'boundaries' even if it doesn't mean I…

12 条评论
Making Remote Social Dynamics Work

2021年2月20日

Making Remote Social Dynamics Work

1) Scrap hub-and-spoke model The hub-and-spoke model can be useful for finding default contacts especially within other…

4 条评论
Self-Made in Data Science: A Good Idea?

2020年11月4日

Self-Made in Data Science: A Good Idea?

What exactly does 'self-made' mean? Self-made or not isn't a binary condition. After all, everyone is self-made to some…

14 条评论
3 1-Minute Hacks to Improve Your Models

2020年10月14日

3 1-Minute Hacks to Improve Your Models

1) If your sole purpose is to predict but not perform any statistical inference, you can speed up your logistic…

1 条评论
Top 10 Most Annoying Data Science Topics on LinkedIn

2020年9月13日

Top 10 Most Annoying Data Science Topics on LinkedIn

LinkedIn posts on Data Science are a carousel of the same ten or so Data Science topics floating around. If these…

18 条评论
A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

2020年8月14日

A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

It's rare that a lay interpretation of a technical matter will be more accurate or less misleading than its technical…

4 条评论
Help Fund BidnBuddy's Clients!

2019年3月12日

Help Fund BidnBuddy's Clients!

I am very happy to announce that BidnBuddy has taken off much better than I'd expected. I received a paltry $70 for my…
Buddy Up or Bid Up: An App for Enhanced Matched Funding

2019年2月7日

Buddy Up or Bid Up: An App for Enhanced Matched Funding

How it works: My funding app BidnBuddy at https://apploft.shinyapps.
Love, Actually

2018年12月13日

Love, Actually

It's holiday season again, and again, some movie called Love, Actually seems to be airing all over the drive-ins…

See all articles

An Evaluation of pycaret's 'Regression - Level Beginner'

Alice SH Wong

Data Science Decadenarian

Alice SH Wong的更多文章

社区洞察

其他会员也浏览了

FSQ Places Engine: combining advanced AI technology with human verification for high quality POI data

Elevate Your Marketing Game: A Personal Guide Through the Data Science Revolution

?? SQL on autopilot

Data and AI Strategy Weekly - November 24, 2024

Chat With Your Objects Using the AIStor Prompt API

Voxel51 Filtered Views Newsletter - October 4, 2024

91% of Models Degrade with Time, Latest NannyML news & User Highlights

A practical 5-step guide to do semantic search on your private data with help of LLMs

Elevate Your Enterprise: Snowflake Cortex Search is the AI Powerhouse You Need!

Unveiling the Invisible: How Advanced Analytics Transforms Market Insights

Alice SH Wong的更多文章

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

Boundaries: Consistency over Levels?

Making Remote Social Dynamics Work

Self-Made in Data Science: A Good Idea?

3 1-Minute Hacks to Improve Your Models

Top 10 Most Annoying Data Science Topics on LinkedIn

A Lay Interpretation of Statistical Significance and p-values: Slow Burn, Love at First Sight or Enduring Love?

Help Fund BidnBuddy's Clients!

Buddy Up or Bid Up: An App for Enhanced Matched Funding

Love, Actually

社区洞察

其他会员也浏览了

FSQ Places Engine: combining advanced AI technology with human verification for high quality POI data

Elevate Your Marketing Game: A Personal Guide Through the Data Science Revolution

?? SQL on autopilot

Data and AI Strategy Weekly - November 24, 2024

Chat With Your Objects Using the AIStor Prompt API

Voxel51 Filtered Views Newsletter - October 4, 2024

91% of Models Degrade with Time, Latest NannyML news & User Highlights

A practical 5-step guide to do semantic search on your private data with help of LLMs

Elevate Your Enterprise: Snowflake Cortex Search is the AI Powerhouse You Need!

Unveiling the Invisible: How Advanced Analytics Transforms Market Insights