Logistic Regression
I have been studying about Logistic Regression from some time. But every time I study it, my understanding enhances or I would say a couple of questions erupt. I am still in process of getting a proper understanding as of now what I feel. However, I will try to share my experiences.
Logistic Regression is a regression algorithm(Refer Edit 3) , also the classification algorithm in predictive analytics. Though it is also a GLM (generalized linear model) algorithm and use a similar procedure as that of Linear regression. However, the output is the probabilities unlike the continuous value in linear regression. In the simplest form, this means that we're considering just one outcome variable and two states of that variable- either 0 or 1.
It is for the classification of binary and linearly separable data. For multi-class classification, we use multinomial logistic regression.?LogReg produces a formula that predicts the?probability of the occurrence?as a function of the independent variables.
Understanding Target Variable and meaning of "AVERAGE" line in LogReg:
Let us assume a scenario where we have to classify the probability of car prone to accident depending on the variables such as how good the person is as the driver, what is the mileage of the car, pollution check and some other factors related to same. These variables can be continuous and categorical as well. (Let's keep aside this discussion for later).We are also assuming that this data is linearly separable.
We will code the accident-prone car as 1 and else 0. If we code like this, then the mean of the distribution is equal to the proportion of 1s in the distribution. For example, if there are 300 cars in the distribution and 50 of them are coded 1, then the mean of the distribution is 0.167, which is the proportion of 1s. The mean of the distribution is also the?probability?of getting a car labeled as 1 at random from the distribution(i.e getting an accident-prone car). That is if we take a car at random from our sample of 300, the probability that the car will be a 1 (or accident prone)is 0.167. Therefore, proportion and probability of 1 are the same in such cases.
If we average over entire training data, we would get the?likelihood?that a random data point would be classified correctly by the system, irrespective of the class it belongs to. Logistic Regression learner tries to maximize. The method adopted for the same is called?maximum likelihood estimation.
The mean of a binary distribution so coded is denoted as p, the proportion of 1s. The proportion of zeros is (1-P), which is sometimes denoted as q. The variance of such a distribution is pq, and the standard deviation is Sqrt(pq).
Now, we will get an equation (boundary function) as follows:
b0 + b1X1 + b2X2 + … + bkXk
Now we take some point which is (x1=a, x2=b, ...xk=k). Putting these input values, we will get some output which can have the following three scenarios:
Therefore the output is from (-infinity to +infinity) while the target variable has only two values. Now, we need some transformation to interpret the output. This is the whole point of doing regression, i.e how a change in predictor brings about a change in output.
Basic Glossary:
Exponents, Logarithms and, Inverse functions:
Exponents
e=2.718
e^(a+b)= e^a * e^b
Inverse
An inverse function does the opposite of some other function. It is useful when we do not know what our input value was:
y=f(x)
g^-1(y)=x
f(?)=4*?=98
taking inverse: g^-1(98)=98/4=24.5
f(24.5)=4*24.5=98
Logarithmic
4^?=216
log(216)=log(4^n) (?=n)
log(4^3)=log(4^n)
n=3
(here, base of log is 4.)
So, what we learned from the inverse function?
We can say that logarithm is the inverse function to exponentiation.
Logarithm and exponentiation are the inverses of the same base b.
Logarithms give the number(n=3) we need to exponentiate b(4) by in order to get y(216).
The relation between Probabilities and Odds Ratio and why we needed the odds ratio:
Probability
Probabilities range between 0 and 1. Let’s say that the probability of success is .8, thus p= .8 Then the probability of failure is q = 1 – p = .2.
Odds
Odds are determined from probabilities and range between 0 and in?nity. Odds are de?ned as the ratio of the probability of success and the probability of failure. Or ,
p(occurence of event)/p(non-occurrence of event)
The odds of success are odds(success) = p/(1-p) or p/q = .8/.2 = 4,
that is, the odds of success are 4 to 1.
The odds of failure would be
odds(failure) = q/p = .2/.8 = .25.
You can switch back and forth between probability and odds—both give you the same information, just on different scales.
Next, we will add another variable to the equation so that we can compute an odds ratio.
What is the odds ratio?
A ratio of two odds, simple.
Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females are admitted. The probabilities for admitting a male are, p = 7/10 = .7 q = 1 – .7 = .3 If you are male, the probability of being admitted is 0.7 and the probability of not being admitted is 0.3.
Here are the same probabilities for females, p = 3/10 = .3 q = 1 – .3 = .7 If you are female it is just the opposite, the probability of being admitted is 0.3 and the probability of not being admitted is 0.7.
Now we can use the probabilities to compute the odds of admission for both males and females,
odds(male) = .7/.3 = 2.33333
odds(female) = .3/.7 = .42857
Next, we compute the odds ratio for admission, OR = 2.3333/.42857 = 5.44 Thus, for a male, they are 5.44 times more likely to get admitted as compared to females.
OR(female)=0.42857/2.33 = 0.1836
Thus, for female, they are 0.183 times as likely to get admitted as males or
females are 81% less likely to get admitted than males.
Log Odds Ratio
Sometimes people give Log of odds-ratio instead of odds-ratio.
logOR(?)=0.34
Now, using the inverse function, the odds-ratio will be 1.417
Conclusion: If O1 is the odds of an event in the Treatment group and O2 is the odds of an event in the control group then the odds ratio is O1/O2. it’s a way of measuring the effect of the program on the odds of an event.
Why we use Odds-ratio, not the probabilities?
The odds ratio represents the constant effect of a predictor X, on the likelihood that one outcome will occur. (it is conditional probability)
In regression models, we often want a measure of the unique effect of each X on Y. If we try to express the effect of X on the likelihood of a categorical Y having a specific value through probability, the effect is not constant.
What that means is there is no way to express in one number how X affects Y in terms of probability. The effect of X on the probability of Y has different values depending on the value of X.
We cannot measure that b1 amount of change in variable x1 brings "??" amount of change in y. The probabilities will keep changing. The whole point of regression is to measure the change, i.e. coefficients.
Going back to our problem of classifying accident-prone car.
To resolve that problem, we need to find a way to cast the logistic regression problem in a manner whereby at least the expression above can be used. Thus if we compute the odds of the outcome as:
odds(p)=p/1-p
But it will only give positive values ranging from zero to infinity but we have seen above (3 scenarios), we need something which spans from (-infinity to +infinity).
we transform it into the natural log of the odds, or logit.
logit(p)=log(p/1-p)
Thus the logit function acts as a link between logistic regression and linear regression and thus it is called a link function.
We use logit because of some important mathematical properties. For one, it often has a?linear?relationship with the levels of the predictor. Also, it can assume any value between -infinity to infinity.
we achieved, what we wanted!
In simple terms: let's remember one of the most fundamental rules of algebra: you can do anything you want to one side of an equation - as long as you do the exact same thing to the other side
After that transformation, we fit a?linear regression. The coefficients come from the results of that linear regression. Therefore, the interpretation of the coefficient is:
For every unit increase in the predictor variable, the logit (or log of the odds) of the outcome changes by the amount of the coefficient.
Assumption of LogReg : there is a linear relationship between the log-odds (of positive class i.e., 1) and the variables of our data.
But we do not think in a logarithmic scale. So, we transform that coefficient, like the odds ratio. To transform, we simply exponentiate the coefficient.
(From what we studied about inverse transformation above. we do not know the probabilities or odds )
Imp: Points to ponder on:
?Logistic regression model is a non-linear transformation of?w^T*x
(Something I was asked in an interview, and I was not able to answer. Non-Linear regression. For better understanding,please refer to the link above)
(Refer to Edit1 as well)
3. None of the observations --the raw data points-- actually fall on the regression line. They all fall on zero or one.
Interpreting the output of Logistic:
If your odds ratio is above 1, increasing your predictor by 1 unit increases the odds of your outcome by the (odds ratio - 1). For example: if the odds ratio is 1.14, the odds of the outcome increases by 14% (1.14–1=0.14) for every unit increase in your predictor.
If your odds ratio is below 1, increasing your predictor by 1 unit decreases the odds of the outcome by (1 - odds ratio). For example, if the odds ratio is 0.80, the odds of the outcome decreases by 20% (1–0.80=0.2) for every unit increase in your predictor.
I hope it is helpful. Please do share your feedback and suggestions.
Edit1: Logistic regression is non-linear?in terms of?Odds?and?Probability, however, it is?linear?in terms of?Log Odds.
Edit 2: Why we take natural logarithm instead of any other like log base 10, 2, etc?
The rate of change of y = e^x?is constant.
d/dx (e^kx)=k e^kx ==>k*(y) (taking log ; the change is k only)
However, d(2^kx)/dx= ln(2) * k*2 ^ kx; ==> ln2*k*(y)(taking log the change is ln(2)*k)
?Basically , you can draw a graph of y=a^x for any value of 'a' .The slope on that graph is always proportional to the y value on the line at any point .
?The magic happens where we choose a value for 'a' where the slope is exactly equal to the 'y' value .?That magic number is 'e' (2.718...) .
Looking for more understanding, refer to the videos of
https://www.youtube.com/channel/UChHwtJYH2PwSK2caxvvftOQ
Edit 3: Thanks to pioneer @Adrian Olszewski for the correction.
"LR is not a regression. It's entirely a regression. You mean logistic classifier built on the top of the LR. It's the binomial regression with logit link and - as any other regression - it models a *NUMERICAL* outcome - the probability of sucess. It's used for regression by thousand of statisticians the same way as the Poisson, gamma, beta or multinomial, fractional and other flavours of the LR. It dates back to 19th century and was used for regression about 50 years before applying it for classification. Every econometrician, every biostatistician, epidemiologist etc. uses it on daily basis to model log odds. Please find the explanation below and correct the text. Otherwise, if a student repeats that on an exam or interview led by a statistician, I guarantee the question will be failed, and more difficult "drilling" questions about the regression may be asked.."
Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI/Big data) ? Against anti-car/-meat/-cash and C40 restrictions
3 年Payal B. I'm sorry, but this totally wrong to say, that LR is not a regression. It's entirely a regression. You mean logistic classifier built on the top of the LR. It's the binomial regression with logit link and - as any other regression - it models a *NUMERICAL* outcome - the probability of sucess. It's used for regression by thousand of statisticians the same way as the Poisson, gamma, beta or multinomial, fractional and other flavours of the LR. It dates back to 19th century and was used for regression about 50 years before applying it for classification. Every econometrician, every biostatistician, epidemiologist etc. uses it on daily basis to model log odds. Please find the explanation below and correct the text. Otherwise, if a student repeats that on an exam or interview led by a statistician, I guarantee the question will be failed, and more difficult "drilling" questions about the regression may be asked...
Manager-BDM at Fine Adhesives / NaturTec India
5 年lucid explaination of complex matter....keep it up !
Lead Data Science Engineer with a focus on AI/ML and Big Data technologies | Architecting ML/DL & Data Pipelines | Python | Hadoop | Spark | PySpark | Kafka | C/C++ | LLMs | RAG | LangChain | MLOps| AI Agents
5 年Perfect for Learning
Principal Data Scientist | Ex-Fractal
5 年Nice explanation. Payal