Logistic regression has been a regression since its birth - and is used this way every day.
Adrian Olszewski
Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI), no SAS ? Against anti-car/-meat/-cash restrictions ? In memory of The Volhynian Mаssасrе
TL;DR
Is linear regression a regression to you? And the Poisson or negative binomial regression? And beta or gamma regression? Cox regression? Quantile regression? To a statistician they all represent essentially the same idea - predicting value of some conditional (to predictor) function of the data. Linear regression (aka General Linear Model) gives you conditional means, E(Y|X=x). Poisson, gamma, logistic, multinomial - (Generalized Linear Model -GLM) give you conditional expectation appropriately "linked" with the predictor (to preserve linearity), u(E(Y|X=x)). Quantile regression gives you conditional quantile, Qi(Y|X=x). Cox gives you conditional hazard function of failure time, λ(t|X=x). Logistic regression is truly no different! It also gives you conditional expectation, as any other of the GLM family, here - for the Bernoulli distribution so it has natural interpretation as a probability (uncalibrated, BTW). The logistic regression was invented and further developed between 1930s and 1970s (Berkson, McFadden, Cox, Nelder, Weddeburn) to replace the existing probit model, essential in answering regression related problems in experiments with binary endpoints. It was years ago before people started using it for classification. Nowadays it is the key regression algorithm not only in experimental research, like clinical trials, where it is used to answer questions about treatment efficacy and safety (through testing hypotheses about main, interaction and simple effects) or explore epidemiological questions about potential risk factors (through marginal effects).
"But, Adrian, its properties make it suitable for classification tasks!", "Adrian, note that it can be obtained from perceptron, a neural network, suitable for classification". Sure! That's right. Also Sir David Cox in his book ("Analysis of Binary Data") mentions also the relationship between logistic regression and discriminant analysis. But that's just one out of multiple applications of the conditional expectation (=regression).
I could call it a "statistical test", because in my field 90% of the time I use it for testing hypotheses (subjecting its regression output to Wald's, Wilks LR or Rao inferential framework). The same way the ML community calls it "classifier" because in 99.9% they subject its (regression!) output to a decision rule. But would it be valid to name it after just single application ignoring ALL OTHERS? Absolutely not. It's too limiting, especially that thousands of statisticians use it for tasks other than classification on daily basis. In all those applications regression is the primary outcome, all others - are secondary. Of course, you can treat it as "a classifier", if you wish, but don't say that "logistic regression is not a regression", because its confusing the existential quantifier ?x ("for some cases") with the universal one ?x (for all cases).
"But Adrian, the coefficients are about log odds, they don't represent change of the response variable directly unlike the linear regression!" - tell me, how many regressions do you know where coefficients represent the change in raw response? ALL regressions (including the linear one) relate predictor with CONDITIONAL EXPECTATION, not the RAW response! Moreover, Poisson regression - integer input and fractional output, log(E(Y|X=x)). Cox regression? Binary input (alive/dead) + time to event result in fractional output = conditional S(t). Where's the problem? Will you reject the entire Generalized LM family (but remember, linear regression is part of it!) or Cox regression as regressions too?
And what if I tell you, that for long years Professor Frank Harrell have been promoting the use of ordinal logistic regression (aka proportional-odds model) for numerical data to obtain distribution-free estimates of means and medians? Surprised? Did you hear about Mann-Whitney (Wilcoxon) or Kruskal-Wallis tests? They are nothing but special cases of the ordinal logistic regression! And you can express them just as linear models over ranked numeric data!
See? All dots connect - because they must - all those are regressions, only with different interpretations and applications, but the underlying concepts are shared by all of them.
If you prefer reading on Medium (also for non-members): https://lnkd.in/dpDXr8qQ
Let's Mortal Combat begin!
Well, it's kinda... awkward for me to write about something that is (should be) obvious to anyone working with statistics but in the last decade has been distorted by hundreds of thousands of members of the Machine Learning community, so today lie replaced the truth...
I remember the first time, when, during some discussion, I said that "I've been using logistic regression for long years on daily basis for regression and testing hypotheses, but I've never used it for classification" and a Data Scientist (with PhD degree) told me, that I must have been mistaken, because "despite its name logistic regression not a regression algorithm". I asked him "then tell me, please, what do I do every day at work???" he replied "I have no idea, but this sounds a pure nonsense, because logistic regression predicts only two binary outcomes so you understand it cannot be a regression".
I was shocked.
For a long time people (mostly researchers, statisticians) already had been reporting to me that a similar situation happened to them during interviews and internet discussions. I did small research, which results knocked me off my feet. I “googled” for terms like “logistic regression is not (a) regression”, “logistic regression is a misnomer” or “logistic regression, despite its name”. The number of findings was huge — they occurred everywhere: in articles, tutorials and courses (also issued by companies offering paid content), blogs, books (including bestsellers in ML written by people holding PhD), YouTube videos. I also repeated the search on LinkedIn and found endless flood of posts repeating this misinformation just copy-pasted from others’ posts.
/ ?? PS: this reveals the sad fact that people way too often, thoughtlessly, repeat what they find on the Internet without any fact checking! /
Not only that! I asked Chat GPT 3 (then 3.5) and got identical results. No surprise! If it was “fed” by misinformed sources, then it learned misinformation, and today it “helps” spreading misinformation to learners. And often learners are those, who may not even suspect that something is wrong, so they trust AI and repeat the nonsense further and further.
There is no single week on LinkedIn without someone repeating it, earning hundreds of ?? -> proving that hundreds of people liked (so tens of thousands saw it) it and... will likely repeat the same.
Finally I decided to write a few words about this "issue". I write from the perspective of a clinical biostatistician, working in clinical trials - part of the pharmaceutical industry responsible for both existing and new therapies (drugs, procedures, devices) evaluation and approval. Here, in clinical trials, the logistic regression is the key regression algorithm, used to answer questions about treatment efficacy and safety based on the data from clinical trials with binary endpoints (success/failure).
Some of my readers might have heard that I have never used logistic regression for classification during the whole time of my professional career. That's right.
Birth of the logistic regression and the... Nobel Prize
The origins of the logistic function can be traced back to the 19th century (free PDF), where it was employed in a "model of population growth". Early attempts (1930s) to model binary data in the regression manner resulted in probit regression model (Bliss, Gaddum), which constituted a standard for the next few decades. Researchers found the outcome not much intuitive, so they searched for a regression model, which coefficients would be easier to interpret. In already 1944 Joseph Berkson started working (bioassay experiments) on the alternative to the probit model, and the "logit" (by analogy to "probit") model was born. Unfortunately, the logit model was rejected by many as inferior to the probit model. It slowly changed around 1950s when George Dyke and H. Patterson's published their paper on applying the linear logistic model to cancer survey data ("Analysis of Factorial Arrangements When the Data Are Proportions"). But it took long years, until the logit model gained similar "trust" (1960-1970), finally refined by Sir David Cox ("Some procedures connected with the logistic qualitative response curve", 1966 and "The regression analysis of binary sequences", 1968).
/ BTW, check also the list of other publications of this Great Mind of Statistics, especially "Analysis of Binary Data (Google Books)" /
Let me make a digression and recall that Sir David Cox, while working on binary-response problems, developed not only the logistic regression, but also the survival regression model (named after him: Cox regression), employing the conditional survival function. See? People tried to approach this problem from various perspectives, if we also briefly mention the latent-variable model (both using the logistic distribution and Gaussian distribution).
Almost in parallel with the multinomial logit model (Cox, Theil), which, finally, in 1973, allowed Daniel McFadden, a famous econometrician, to piece together existing puzzles, including the Duncan Luce's choice axiom, into a whole, which resulted in a theoretical foundation for the logistic regression. At that time, McFadden was deeply involved in pioneering work in developing the theoretical basis for discrete choice where he applied the logistic regression for empirical analysis. His work, making a profound impact on the analysis of discrete choice problems in economics and other fields, gave him the Nobel Prize in 2000.
I think we can fairly say that Daniel McFadden's work on the logistic (ordinary and multinomial) regression model and the discrete choice analysis was truly groundbreaking. It played a significant role in establishing logistic regression as a solid tool in statistical analysis, not only in econometrics!
Remember the rejection of the logit model, found inferior to the probit one? Now the situation reversed, and logistic regression today is the default approach.
1970s were truly fruitful to logistic regression! In 1972, Sir John Nelder and Robert Weddeburn, in their seminal work (free PDF), introduced the idea of a unified framework: the Generalized Linear Model (GLM). It enabled regression models to cope with response variables of any type (counts, categories, continuous), through various conditional (to predictor) distributions, including Bernoulli (binomial with k=1), Poisson, gamma, Gaussian, and appropriate linking functions (log, logit, reciprocal, identity), relaxing the assumption of normal distribution of errors for inference.
/ ?? Logistic regression is a special case of the GLM. You can spot it easily when working with the R statistical package: when you call the glm() function, you need to specify the family of conditional response distribution -here "binomial"- along with appropriate link -here "logit": glm(family = binomial(link = "logit")) /
Just a decade later, two other big names you know for sure, Prof. Trevor Hastie and Prof. Robert Tibshirani extended the Generalized Linear Model (logistic regression is a special case of it) to the Generalized Additive Model. In their articles (e.g. "Generalized Additive Models for Medical Research", https://doi.org/10.1177/096228029500400 ) they mention the role of logistic regression in identification and adjustment for prognostic factors in clinical trials and observational studies.
/ ?? Did you know that Professor Trevor Hastie authored the glm() command in the S-PLUS statistical suite, which is the father of GNU R? Yes, S is the origin of R syntax and was still in use a few years ago; I did statistical analyses in S-PLUS. /
Additional extensions for handling repeated observations were made by Kung-Yee Liang and Scott L. Zeger in 1986 via Generalized Estimating Equations (GEE) and Breslow, Clayton and others around 1993, when the theory of Generalized Linear Mixed Models (GLMM) was born.
I can only imagine McFadden's and others' reaction to the nonsense "logistic regression is not a regression"...
Conditional expectation - the key to understand the GLM
Every regression describes a relationship between the predictor and some function of the conditional response. It can be a quantile, Qith(Y|x=x), as in the quantile regression. Or some trimmed estimator of the expected value, like in the robust regression. Or - the expected value of the conditional response (=conditional expectation) itself, like in the classic linear regression: E(Y|X=x).
/ so often confused with one of the estimation algorithms --> "OLS regression" - don't repeat that. /
Now, it's all about the conditional distribution. If it's Gaussian (normal distribution), you obtain the linear regression. But the GLM allows you to use also other distributions: Bernoulli (or binomial), gamma, Poisson, negative binomial, etc. The problem is that then the conditional expectations are not linearly related with the predictor, which is something we really want. That's why we have the link function, linking the conditional expectation and the predictor for a given conditional distribution: g(E(Y|X=x)) = Xb (sometimes you will see this formula reversed: E(Y|X=x) = g-1(Xb). It's equivalent formulation).
Now, the expected values are "linearized" with respect to the predictor. For the ordinary linear regression you don't need that, so the g() is just I() (identity function, which we omit) - the expected values lay on a straight line, plane, or hyperplane (depending on how many predictors you have).
/ The name, conditional expectation, is also perfectly visible when you do ANOVA. That's just 1:1, perfect example: the levels of categorical predictor(s) "form" sub-distributions, and mean is calculated in each. Now you also understand what it means: "expected value CONDITIONAL to the predictor"! /
Below we can observe various conditional distributions and their means. The means lay on a straight line transformed by the g() function, the link.
/ OK, I know, the illustration isn't perfect, simplifications are made, but let's agree on its imperfection, as long as it shows the main idea, huh? /
BTW: This is well explained in the book I recommend you to read:
Now, let's answer a few questions:
I hope you can see from this that logistic regression, as any other regression, predicts a numerical outcome, NOT categorical.
Q: But, Adrian! In my preferred ML toolkit the logistic regression returns just classes!
A: Sure, because ML focuses on classification, so it takes ADDITIONAL STEP and turns probabilities into classified labels. In other words your procedure turns logistic regression into the logistic classifier. The two are NOT the same and serve DIFFERENT purposes!
How is the logistic regression turned into a classifier?
The outcome from the logistic regression, the conditional probability (therefore logistic regression is called also a "direct probability estimator") subjected to a conditional rule IF-THEN-ELSE , which compares it against some threshold (usually 0.5, but this shouldn't be taken as granted!) and returns the category:
IF (p < 0.5) THEN A ELSE B
- Wait, but this is NOT a regression! This USES the regression prediction instead!
Glad you spotted it!
Too often people do not and just repeat that "logistic regression predicts binary outcome". And when I tell them "but what about the regression term in it, which means that it should predict a numerical value?", they respond "Oh! It's a misnomer! Despite its name, logistic regression isn't a regression because it it doesn't predict numerical outcome!".
In other words, they do something like this:
... making a direct jump from binary input to binary output:
But notice, they did not change the name accordingly. Instead of calling it “Logistic Classifier”, the ML community left the name “Logistic Regression”. We could say they “appropriated the logistic regression”.
Consequently, they have problems with justifying the existing name.
Isn't this just crazy?
Now please, re-read the points 1-6 to see how ridiculous this approach is.
Despite numerous regression-related problems, where the logistic regression is used every day, the situation looks like below:
So once in a lifetime, let's recall what is the difference between logistic regression and logistic classifier:
领英推荐
But everyone uses logistic regression for classification!
Ah, argumentum ad populum ;]
OK then:
So while I can understand someone saying that "in ML, logistic regression is a classification algorithm", I cannot agree that "logistic regression is not a regression". A single specific application, employing also additional steps, and producing a different (categorized) output does not invalidate the "core" engine.
The fact that a tomato can be used to cook a soup (involving many steps) does not mean that "tomato is not a fruit - it is a misnomer, because tomato is a soup ingredient". It's that simple.
Look, how was the logistic regression addressed years before these "non-regression" nonsenses, where statisticians were developing the basic tools, now called "Machine Learning" ?? This is an excerpt from Prof. Harrell's paper The Practical Value Of Logistic Regression
See the sentence: "[...]of choice for many regression-type problems[...]"? It may sound weird to ML and Data Science specialists, but that's exactly how statisticians treat and use this very element of the Generalized Linear Model ?? Even more - this is exactly why it was invented and further developed by Berkson, McFadden, Cox, and others.
By the way, professor Frank Harrell wrote a series of papers (and mentioned also in his book: "Regression Modelling Strategies") about applying the ordinal logistic regression (aka proportional-odds model) to numerical data. This way, for example, you can test hypotheses (for any number of categorical predictors = factors, their interactions, also adjusted for numerical covariates!) in a distribution-free manner. Surprised? But the Mann-Whitney (aka Wilcoxon) and Kruskal-Wallis tests are nothing but just special cases of the ordinal logistic regression! Even better, you can obtain the empirical CDF for the data, and estimate both arithmetic mean and quantiles from it! Check the "rms" R package and this website - digitalized version of his Regression Modelling Strategies famous book: https://hbiostat.org/rmsc/cony
See? Ordinal regression model used for numerical data, like any other regression model, and it allows you to estimate the empirical CDF and predict means and quantiles!
If ordinal logistic regression is a regression, if multinomial logistic regression is a regression, then why the "normal" logistic regression is NOT a regression? Can you see the nonsense in denying its regression nature?
Q: But Adrian, Logistic regression returns probability, which is used for classification, so anyway the nature of the logistic regression is a classifier!
A: OK, but the fact that something exposes "some nature" doesn't invalidate it's "original nature", especially that the "original nature" drove its invention ??
In the range (about) 0.2 - 0.8 the the sigmoid curve can be approximated by a linear segment, for instance obtained from the... linear regression. You can treat the obtained prediction as probability and use it for classification (in the past it was used and called a "Linear Probability Model"; Link 1, Link 2). Does it make the linear regression as classifier? (check-mate, ML? ??). Well, probably... yes (partially - in the given range - because why not?). Does it mean that "linear regression is NOT a regression"? I guess - no? ??
It only shows that one out of many applications is possible.
Regression-related applications of the logistic regression (and its friends)
Multiple times I mentioned that logistic regression is used by me and other statisticians to non-classification, regression tasks. Believe me, there is NO difference from any other regression!
In my field, clinical trials, I use the logistic regression on almost daily basis for:
Well, definitely - very "non-regression" applications. All "misnomers" - "misnomers everywhere"...
Friends of the logistic regression
Logistic regression has many friends that were invented to address various problems related to regression. Let us enumerate them and briefly describe:
? ... be my sweet model ??...
I really like the name "model". It's a very... "inclusive" name. A model can have many purposes. I use the logistic model for regression related tasks: inference about the model effects and predictions. I also use it to check the MCAR (missing completely at random) missing data pattern.
You can use it to derive a classifier (that could be derived also from perceptron, for instance).
My colleague uses it for propensity score matching in observational studies.
It's also part of the Inverse Probability Weighting (IPW) method itself having multiple applications (e.g. to handle monotonous dropouts with GEE estimation).
Maybe we should call it a Poisson model? A logistic model? A linear model?
... ... ...
PS: but still, the logistic model provides the E(Y|X=x), doesn't it?
Literature
I will populate this chapter with textual references later. For now, find the "collage" of covers. And believe, neither of these books will say that "logistic regression is not a regression" :)
+ recently found an excellent one:
Other authors also prove it can be done properly:
ad hoc comments from my readers
A: Of course they did! It's a book about machine learning, so this kind of application is of interest and highly expectable. BUT they’ve never said it's not a regression model.? They both wrote also a series of articles on the application of the proportional hazard models and the logistic regression in biostatistical (they worked in the division of biostatistics) applications in the regression manner (assessment of the prognostic factors, assessment of the treatment effect) and call it a regression model.
Also In the book you mention, on page 121-122 + the following examples they say: "Logistic regression models are used mostly as a data analysis and inference tool, where the goal is to understand the role of the input variables in explaining the outcome. Typically many models are fit in a search for a parsimonious model involving a subset of the variables, possibly with some interactions terms."
A:
A: ChatGPT will repeat what was trained on. Don't rely on it strictly when you are learning a new topic, because what you will be told strongly depends on how you will ask. It was trained on mixed good and bad resources, so sometimes the valid one is "allowed to speak" but just a few questions later it may be messing again. This pertains to ANY kind of topic, not only in statistics. DO ALWAYS verify the responses from any AI-based system if you are going to learn from it, pass your exams or an interview, or do your job.
PS: I was told that the newest version of ChatGPT is much better, so give it a try.
A: Either to use the name “logistic classifier” to highlight it uses the regression “engine” under the hood, or precise it as follows: “Despite the logistic regression was originally invented to solve regression problems (McFadden, Cox, Nelder, Weddeburn, Hastie, Tibshirani) and is used nowadays in this way by statisticians (for example in experimental research), Machine Learning specialists use it exclusively for classification purposes, adding one more step, a conditional decision rule based on a threshold, and turning the predicted conditional expectation into a classifier”.
OK, just to summarize:
I hope that after reading this story the “inner temptation” to repeat “logistic regression is not a regression” can be silenced once in a lifetime.
J'aide les Heads of Data / IT à créer et à fiabiliser leurs systèmes grace au ML Engineering et au Data Engineering ?? PhD en IA & Data Engineering
1 个月Patrice Rouzaire
Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI), no SAS ? Against anti-car/-meat/-cash restrictions ? In memory of The Volhynian Mаssасrе
4 个月Nick Ford I thought you might find this interesting.
Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI), no SAS ? Against anti-car/-meat/-cash restrictions ? In memory of The Volhynian Mаssасrе
5 个月Feras Alqrinawi I thought you might find this useful for your posts about data science algorithms.
Product Manager | PhD. Candidate | Data Scientist | Economist
5 个月Adrian, your content is AWESOME!
Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI), no SAS ? Against anti-car/-meat/-cash restrictions ? In memory of The Volhynian Mаssасrе
6 个月Suraj Wakshe You may find it useful, as this is the key-importane tool to analyse clinical trials with binary endpoints.