An Evaluation of pycaret's 'Regression - Level Beginner'

LinkedIn seems aflush these days with demonstrations of pycaret and demonstrations of covid-19 prediction skills, with the two overlapping more than some might be comfortable with, especially given pycaret's very recent debut. While focusing on some of my major technical questions surrounding it and pointing out some obvious strengths and deficiencies, many cultural phenomena in the Data Science world will increasingly rear their ugly heads as we go along this analysis of pycaret. Before I launch into my evaluation, here's a disclaimer that pycaret is in its early days - any apparent deficiencies might simply be a result of my unfamiliarity with where its more esoteric documentation is online or the premature (over)use of the package by the Data Science community before pycaret is ready to fully complete its documentation or even methodological repertoire. Yet I feel compelled to publish a rather premature evaluation of pycaret as it is, simply because its use in some of our more serious industries or on some of our more widely read media like LinkedIn risks the proliferation of not fake news but the methodological equivalent of it. I'd be happy if I were wrong as the stakes are high, and no harm done if I were, but if I weren't, I'm hoping that this assessment, along with other cultural observations, will lend much-needed caution and responsibility to the Data Science community in the high-stakes endeavors of healthcare research especially.

1) Large yet limited range of models

pycaret boasts over 20 models, all of which can be compared at once with a single command. You've got to give them props for having something titled 'Passive Aggressive Regression' for marketing purposes at least. Some offshoots of very popular algorithms like ExtraTrees (from RandomForest) and Light GBM (from GBM) also provide variety subject to widespread theoretical knowledge. However, many of the models are quite similar to each other, with the limitation being that similar models tend to produce similar results as well as be used for similar purposes. An example of not just similarity but redundancy is the presence of Elastic Net, lasso regression and ridge regression, where setting the L1 or L2 to 0 for Elastic Net would effectively be picking a lasso or ridge regression.

There is nothing wrong with having similar models to produce the slight edge that is so pivotal in certain industries like adtech, but the bigger issue is that the range of models omits even the most commonly used ones at work and some of the most basic models one might learn in an advanced undergraduate or basic graduate Statistics class. There is no zero-inflated Poisson model for modeling very rare occurrences, no logistic regression (this was a trick comment - they classed it under 'Classification,' as if regressions weren't capable of classification), no quantile regression to deal with heavily skewed distributions of outcomes, no survival analysis to deal with time-to-event, censored data and no mixed effects models (otherwise known as mixed models or hierarchical models) to deal with hidden or latent effects of individual (group) trajectories. To be fair, the latter two should probably not be at the beginner's level, though it isn't at pycaret's intermediate level either. The expert level is still in the making, but if history repeats itself, disciplinary sidelining (by Pythonistas of Statistics, a discipline comprising half of the US' Data Scientists) might cause them to be missing for years, just as Google Camp's planned development in Python of mixed models and survival analysis models beyond their one canonical flavor has been stalling since 2015.

2) Opaque metrics

In 2016, I wrote a LinkedIn article lamenting the myths of Data Science, some of which are an assumed proficiency of all practitioners. Those were early days before the years went on to prove the quackery of some practitioners, the failure of expensive Data Science projects and the attendant all-but-bursting Data Science bubble. One of the key harbingers of sloppy Statistics (and thus Data Science) was - and still is - scikit-learn's default use of the R-squared as opposed to the adjusted R-squared, which penalizes for additional input factors which might inflate the generalizability of the results from the training data to unseen data. So many years have passed since Python's first real attempt at statistical modeling with scikit-learn's introduction, and it seems scant feedback has allowed such oversights to persist in pycaret, which also has the R-squared rather than the adjusted R-squared as the metric of choice alongside RMSE, MAE, MAPE, etc.

3) Fundamental Statistical Omissions

The cultural implications of pycaret, or at least the way it's made its splash into the Data Science world, are abundantly clear from the lack of any display of traditional (linear) regression coefficients or machine learning algorithms' variable importance measures such as RandomForest's or XGBoost's. At least, while scikit-learn outputs variable importance measures for RandomForest and XGBoost easily, pycaret has no option to do that, or Google Search might have personalized my search so vindictively as to make a fool of me here. Both pycaret and scikit-learn seem equally dismissive of the importance of regression coefficients, which aren't displayed at all in pycaret and which require an additional package to be downloaded for their display in scikit-learn. Base R, however, displays coefficients, standard errors, confidence intervals and p-values with a simple 'summary' statement.

This begs the question of why coefficients (and confidence intervals and significance levels) and other less standard variable importance measures have been neglected in certain Data Science cultural milieu, especially the Python-ML-engineering camps, when industries anywhere from healthcare to insurance to retail to adtech all crave this transparency of what input factors they can pull as levers to dial up or dial down certain outcomes. Perhaps these Python developers think they can't be everything to everyone and choose to focus on model fit and predictive accuracy instead of interpretability (and thus obvious actionability) - however, perhaps they have lost sight of empirical research's findings that interpretable models with standard coefficients can actually perform predictions as well as, if not better than, 'state-of-the-art' deep learning models in non-astronomical sample sizes. In my own industry experience, my most successful models for prediction itself have equally been traditional regression models as well as less interpretable machine learning models.

4) Opaque, suboptimal and missing preprocessing steps

Unsurprisingly perhaps, given the neglect of arguably the most widely used and versatile of all methodologies, the linear regression, all the canonical pre-modeling inspection steps to check for its canonical assumptions like normality of residuals and constancy of variance are missing, whereas R outputs quite a few of these diagnostic plots simply by calling 'plots' on the model object.

Again, I might chalk it down to Google Search's improbable vendetta against me (though why didn't I search from the back?), but I haven't found any documentation, even after looking at the 'Intermediate' version, on how the following preprocessing steps, among others, are achieved: removing outliers, numeric imputer, feature selection, feature interaction, and high cardinality method ('PCA' exists separately, so I assume it isn't that, though what happens if you turn off the high cardinality method but turn PCA on - wouldn't it be a contradiction?). As for removing multicollinearity, transforming target, and transformations (of input variables, I assume), I've caught a glimpse of their and pycaret's general mechanisms through looking at the intermediate version. It seems as though one doesn't have an option to choose how to transform variables, for example, but is at least shown what the method is after the model has run. I suppose each algorithm has a few transformation methods it tries and just picks the one producing the best output metrics (though as for which one it optimizes for, I'm not sure), or perhaps it has just one. Then there is the issue of potentially conflicting situations where we turn on 'feature selection' yet are using a lasso regression (which selects your features) or have the algorithm choose 'box-cox transformation' by itself while we have turned off 'polynomial features,' thus blocking the box-cox method from choosing any polynomial transformations.

Regarding the rare glimpses into preprocessing methods, quite a handful have given me pause for thought. I caught a glimpse of 'mean' being used as an imputation method. That is fair enough if we were handcoding from scratch and wanted some quick results, but the whole point of pycaret is to be efficient, and since it's already got k-nearest-neighbors and RandomForest built in, one wonders why they couldn't just use such methods for imputation. As for 'numeric binning,' the intermediate version reveals that Sturge's rule is being used. Sturge's rule is a one-liner whose formula depends only on the number of observations. Without knowing the formula, you'd probably already guess that there're much better methods for finding bins, especially if your purpose is to maximize your chances of finding a statistically significant variable or to enhance predictive accuracy. You could run a loop through various breaks in a variable and find the set of breaks that minimizes the p-value in a univariate test of its association with the outcome of interest. Even more simply, you could view scatterplots of each variable with the outcome to pick some breaks. The last preprocessing step I'll address is the removal of multicollinearity. The method hinted at to deal with it would not be any rigorous practitioner's first choice, which would be the VIF or GVIF. In the official pycaret tutorial, a multicollinearity threshold of .95 was output, hinting that a correlation instead of VIF, is being used. The issue with using correlations alone is that a correlation may be very high but meaningless (and not to be acted upon) if not statistically significant. This could occur if the sample sizes were too small. Think of how someone could bully you into saying that it's been hot for the past few days, fewer people are eating ice cream today, and therefore, that there's a full 1.0 correlation between warm weather and avoidance of ice cream. Statistical significance is so important a concept that any (treatment) effect reported in any serious scientific journal (not the likes of 'Nature' though it has its uses) would be considered moot if p-values or its corollary, confidence intervals, aren't reported. Yet the Python development world has given free rein to the display of correlations without significance levels, most notably seen in the ever-popular O'Reilly's 'Python for Data Analysis,' that pernicious bible for every self-making Data Scientist who might've omitted to consult more theoretically rigorous guides. Yet the emphasis on learning a programming language has superceded theory so much that sloppy practices are often overlooked, and in this case, practised, in pycaret. Don't get me wrong - programming is often an integral tool to Data Science unless pycaret makes some serious improvements for every programmophobe, but pure programmers without sufficient statistical background should restrict their lessons to pure programming and not try to cash in on the latest data analysis trend, not without first consulting people with an extensive statistical training at least.


Conclusion

pycaret, despite its hype, seems to be a pale version of scikit-learn, inheriting its current deficiencies as well as lagging behind in some respects. It's needless to point out its obvious triumph, the ability to test multiple algorithms all at once, even if each of these algorithms, owing to some missing and suboptimal preprocessing steps, might not even be optimized for predictive accuracy. Crucially, we must remember too that pycaret's focus seems to be predictive accuracy at the expense of interpretable variable importances and significances while there is typically no tradeoff between interpretability and predictive accuracy for interpretable models.

Addendum

For all practitioners who are using pycaret, caution is urged as there is no substitute for a rigorous theoretical (and I mean a comprehensive one including fundamentals, not just the trendy ones like 'deep learning') training complemented by the usual programming skills, which we apparently still need because pycaret isn't where we might want it to be yet. A revisit of all the deficiencies I've highlighted above might hint that what we take for granted as simple, such as a linear regression, is often the most tricky of all with hidden pitfalls, and not only will a convenient platform solution like pycaret not address such hidden traps, it can often blind us to the need to assiduously address modeling assumptions and the sheer meticulousness that goes into preprocessing. There have been many calls for 'non-experts' to refrain from covid-19 modeling especially and there must surely be some tension arising from it. I understand the compulsion to help out in a time like this, and instead of urging anyone who might not have much of a theoretical and domain-related background to refrain from getting knee-deep in (medical) healthcare research, I would instead urge you to get into it - and to take it so seriously as to either get a relevant degree or get your home-schooled but carefully curated version of it, and then try to gain work experience in the field thereafter. This latter option must be very mindfully curated, as the most trendy of Data Science courses like Coursera, for example, have traditionally not offered A/B testing. Yet A/B tests (not under that umbrella term but by specific tests like 'chi-square test') are prevalent in scientific journals in all their prerequisite glory, as the most rigorous of models may still run into multicollinearity problems where one variable might mask another's effect, while a univariate A/B test will show a variable's association with the outcome as it is. If you are keen to help out in healthcare, just do it, and make it a labor of love with a real, substantial sacrifice of your time for others and the self-reward of rigorous learning.

要查看或添加评论,请登录

Alice SH Wong的更多文章

社区洞察

其他会员也浏览了