Logistic Regression

Logistic Regression

I have been studying about Logistic Regression from some time. But every time I study it, my understanding enhances or I would say a couple of questions erupt. I am still in process of getting a proper understanding as of now what I feel. However, I will try to share my experiences.

Logistic Regression is a regression algorithm(Refer Edit 3) , also the classification algorithm in predictive analytics. Though it is also a GLM (generalized linear model) algorithm and use a similar procedure as that of Linear regression. However, the output is the probabilities unlike the continuous value in linear regression. In the simplest form, this means that we're considering just one outcome variable and two states of that variable- either 0 or 1.

It is for the classification of binary and linearly separable data. For multi-class classification, we use multinomial logistic regression.?LogReg produces a formula that predicts the?probability of the occurrence?as a function of the independent variables.

Understanding Target Variable and meaning of "AVERAGE" line in LogReg:

Let us assume a scenario where we have to classify the probability of car prone to accident depending on the variables such as how good the person is as the driver, what is the mileage of the car, pollution check and some other factors related to same. These variables can be continuous and categorical as well. (Let's keep aside this discussion for later).We are also assuming that this data is linearly separable.

We will code the accident-prone car as 1 and else 0. If we code like this, then the mean of the distribution is equal to the proportion of 1s in the distribution. For example, if there are 300 cars in the distribution and 50 of them are coded 1, then the mean of the distribution is 0.167, which is the proportion of 1s. The mean of the distribution is also the?probability?of getting a car labeled as 1 at random from the distribution(i.e getting an accident-prone car). That is if we take a car at random from our sample of 300, the probability that the car will be a 1 (or accident prone)is 0.167. Therefore, proportion and probability of 1 are the same in such cases.

If we average over entire training data, we would get the?likelihood?that a random data point would be classified correctly by the system, irrespective of the class it belongs to. Logistic Regression learner tries to maximize. The method adopted for the same is called?maximum likelihood estimation.

The mean of a binary distribution so coded is denoted as p, the proportion of 1s. The proportion of zeros is (1-P), which is sometimes denoted as q. The variance of such a distribution is pq, and the standard deviation is Sqrt(pq).

Now, we will get an equation (boundary function) as follows:

b0 + b1X1 + b2X2 + … + bkXk

Now we take some point which is (x1=a, x2=b, ...xk=k). Putting these input values, we will get some output which can have the following three scenarios:

  1. this equation will have a positive outcome which is( 0 to infinity) i.e. the higher the magnitude of this value, the greater is the distance between the point and the boundary.
  2. this equation will have a negative outcome which is (- infinity to 0) i.e the higher the magnitude of this value, the greater is the distance between the point and the boundary. (below the boundary)
  3. It can be zero, i.e. point lies on the boundary.

Therefore the output is from (-infinity to +infinity) while the target variable has only two values. Now, we need some transformation to interpret the output. This is the whole point of doing regression, i.e how a change in predictor brings about a change in output.

Basic Glossary:

Exponents, Logarithms and, Inverse functions:

Exponents

e=2.718

e^(a+b)= e^a * e^b

Inverse

An inverse function does the opposite of some other function. It is useful when we do not know what our input value was:

y=f(x)

g^-1(y)=x

f(?)=4*?=98

taking inverse: g^-1(98)=98/4=24.5

f(24.5)=4*24.5=98

Logarithmic

4^?=216

log(216)=log(4^n) (?=n)

log(4^3)=log(4^n)

n=3

(here, base of log is 4.)

So, what we learned from the inverse function?

We can say that logarithm is the inverse function to exponentiation.

Logarithm and exponentiation are the inverses of the same base b.

Logarithms give the number(n=3) we need to exponentiate b(4) by in order to get y(216).

The relation between Probabilities and Odds Ratio and why we needed the odds ratio:

Probability

Probabilities range between 0 and 1. Let’s say that the probability of success is .8, thus p= .8 Then the probability of failure is q = 1 – p = .2.

Odds

Odds are determined from probabilities and range between 0 and in?nity. Odds are de?ned as the ratio of the probability of success and the probability of failure. Or ,

p(occurence of event)/p(non-occurrence of event)

The odds of success are odds(success) = p/(1-p) or p/q = .8/.2 = 4,

that is, the odds of success are 4 to 1.

The odds of failure would be

odds(failure) = q/p = .2/.8 = .25.

You can switch back and forth between probability and odds—both give you the same information, just on different scales.

Next, we will add another variable to the equation so that we can compute an odds ratio.

What is the odds ratio?

A ratio of two odds, simple.

Suppose that seven out of 10 males are admitted to an engineering school while three of 10 females are admitted. The probabilities for admitting a male are, p = 7/10 = .7 q = 1 – .7 = .3 If you are male, the probability of being admitted is 0.7 and the probability of not being admitted is 0.3.

Here are the same probabilities for females, p = 3/10 = .3 q = 1 – .3 = .7 If you are female it is just the opposite, the probability of being admitted is 0.3 and the probability of not being admitted is 0.7.

Now we can use the probabilities to compute the odds of admission for both males and females,

odds(male) = .7/.3 = 2.33333

odds(female) = .3/.7 = .42857

Next, we compute the odds ratio for admission, OR = 2.3333/.42857 = 5.44 Thus, for a male, they are 5.44 times more likely to get admitted as compared to females.

OR(female)=0.42857/2.33 = 0.1836

Thus, for female, they are 0.183 times as likely to get admitted as males or

females are 81% less likely to get admitted than males.

Log Odds Ratio

Sometimes people give Log of odds-ratio instead of odds-ratio.

logOR(?)=0.34

Now, using the inverse function, the odds-ratio will be 1.417

Conclusion: If O1 is the odds of an event in the Treatment group and O2 is the odds of an event in the control group then the odds ratio is O1/O2. it’s a way of measuring the effect of the program on the odds of an event.

Why we use Odds-ratio, not the probabilities?

The odds ratio represents the constant effect of a predictor X, on the likelihood that one outcome will occur. (it is conditional probability)

In regression models, we often want a measure of the unique effect of each X on Y. If we try to express the effect of X on the likelihood of a categorical Y having a specific value through probability, the effect is not constant.

What that means is there is no way to express in one number how X affects Y in terms of probability. The effect of X on the probability of Y has different values depending on the value of X.

We cannot measure that b1 amount of change in variable x1 brings "??" amount of change in y. The probabilities will keep changing. The whole point of regression is to measure the change, i.e. coefficients.

Going back to our problem of classifying accident-prone car.

No alt text provided for this image

To resolve that problem, we need to find a way to cast the logistic regression problem in a manner whereby at least the expression above can be used. Thus if we compute the odds of the outcome as:

odds(p)=p/1-p

But it will only give positive values ranging from zero to infinity but we have seen above (3 scenarios), we need something which spans from (-infinity to +infinity).

we transform it into the natural log of the odds, or logit.

logit(p)=log(p/1-p)

No alt text provided for this image


Thus the logit function acts as a link between logistic regression and linear regression and thus it is called a link function.

We use logit because of some important mathematical properties. For one, it often has a?linear?relationship with the levels of the predictor. Also, it can assume any value between -infinity to infinity.

we achieved, what we wanted!

In simple terms: let's remember one of the most fundamental rules of algebra: you can do anything you want to one side of an equation - as long as you do the exact same thing to the other side

After that transformation, we fit a?linear regression. The coefficients come from the results of that linear regression. Therefore, the interpretation of the coefficient is:

For every unit increase in the predictor variable, the logit (or log of the odds) of the outcome changes by the amount of the coefficient.

Assumption of LogReg : there is a linear relationship between the log-odds (of positive class i.e., 1) and the variables of our data.

(https://stats.stackexchange.com/questions/280535/log-odds-ratio-what-happens-if-linearity-fails)

But we do not think in a logarithmic scale. So, we transform that coefficient, like the odds ratio. To transform, we simply exponentiate the coefficient.

No alt text provided for this image

(From what we studied about inverse transformation above. we do not know the probabilities or odds )

No alt text provided for this image


Imp: Points to ponder on:

No alt text provided for this image


  1. The regression line is a rolling average, just as in linear regression. The Y-axis is p, which indicates the proportion of 1s at any given value of x.
  2. The regression line is nonlinear.

?Logistic regression model is a non-linear transformation of?w^T*x


(Something I was asked in an interview, and I was not able to answer. Non-Linear regression. For better understanding,please refer to the link above)
(Refer to Edit1 as well)

3. None of the observations --the raw data points-- actually fall on the regression line. They all fall on zero or one.

We could also plot the relations between the DV and predictor variable as we do in regression. The plot might look something like this


Interpreting the output of Logistic:

  • when the odds ratio is greater than 1, it describes a positive relationship.

If your odds ratio is above 1, increasing your predictor by 1 unit increases the odds of your outcome by the (odds ratio - 1). For example: if the odds ratio is 1.14, the odds of the outcome increases by 14% (1.14–1=0.14) for every unit increase in your predictor.

  • an odds ratio less than 1 implies a negative relationship.

If your odds ratio is below 1, increasing your predictor by 1 unit decreases the odds of the outcome by (1 - odds ratio). For example, if the odds ratio is 0.80, the odds of the outcome decreases by 20% (1–0.80=0.2) for every unit increase in your predictor.

I hope it is helpful. Please do share your feedback and suggestions.

Edit1: Logistic regression is non-linear?in terms of?Odds?and?Probability, however, it is?linear?in terms of?Log Odds.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image


Edit 2: Why we take natural logarithm instead of any other like log base 10, 2, etc?

The rate of change of y = e^x?is constant.

d/dx (e^kx)=k e^kx ==>k*(y) (taking log ; the change is k only)

However, d(2^kx)/dx= ln(2) * k*2 ^ kx; ==> ln2*k*(y)(taking log the change is ln(2)*k)

?Basically , you can draw a graph of y=a^x for any value of 'a' .The slope on that graph is always proportional to the y value on the line at any point .

?The magic happens where we choose a value for 'a' where the slope is exactly equal to the 'y' value .?That magic number is 'e' (2.718...) .

Looking for more understanding, refer to the videos of

Niranjan Salimath

https://www.youtube.com/channel/UChHwtJYH2PwSK2caxvvftOQ


Edit 3: Thanks to pioneer @Adrian Olszewski for the correction.

"LR is not a regression. It's entirely a regression. You mean logistic classifier built on the top of the LR. It's the binomial regression with logit link and - as any other regression - it models a *NUMERICAL* outcome - the probability of sucess. It's used for regression by thousand of statisticians the same way as the Poisson, gamma, beta or multinomial, fractional and other flavours of the LR. It dates back to 19th century and was used for regression about 50 years before applying it for classification. Every econometrician, every biostatistician, epidemiologist etc. uses it on daily basis to model log odds. Please find the explanation below and correct the text. Otherwise, if a student repeats that on an exam or interview led by a statistician, I guarantee the question will be failed, and more difficult "drilling" questions about the regression may be asked.."

Adrian Olszewski

Clinical Trials Biostatistician at 2KMM (100% R-based CRO) ? Frequentist (non-Bayesian) paradigm ? NOT a Data Scientist (no ML/AI/Big data) ? Against anti-car/-meat/-cash and C40 restrictions

3 年

Payal B. I'm sorry, but this totally wrong to say, that LR is not a regression. It's entirely a regression. You mean logistic classifier built on the top of the LR. It's the binomial regression with logit link and - as any other regression - it models a *NUMERICAL* outcome - the probability of sucess. It's used for regression by thousand of statisticians the same way as the Poisson, gamma, beta or multinomial, fractional and other flavours of the LR. It dates back to 19th century and was used for regression about 50 years before applying it for classification. Every econometrician, every biostatistician, epidemiologist etc. uses it on daily basis to model log odds. Please find the explanation below and correct the text. Otherwise, if a student repeats that on an exam or interview led by a statistician, I guarantee the question will be failed, and more difficult "drilling" questions about the regression may be asked...

  • 该图片无替代文字
回复
Mohit Rampal

Manager-BDM at Fine Adhesives / NaturTec India

5 年

lucid explaination of complex matter....keep it up !

Roshan W.P

Lead Data Science Engineer with a focus on AI/ML and Big Data technologies | Architecting ML/DL & Data Pipelines | Python | Hadoop | Spark | PySpark | Kafka | C/C++ | LLMs | RAG | LangChain | MLOps| AI Agents

5 年

Perfect for Learning

Satish Vavilapalli

Principal Data Scientist | Ex-Fractal

5 年

Nice explanation. Payal

要查看或添加评论,请登录

Payal B.的更多文章

  • Goals, Habits, and Make it Happen

    Goals, Habits, and Make it Happen

    “New Year New ME” How many of us have thought of this? I will take this challenge from TOMORROW/NEW YEAR. Set goals…

  • Bayes Theorem

    Bayes Theorem

    It won’t be for 5 years old but 12 years old. I suppose that you must have heard of Aarushi Talwar Murder case.

    1 条评论

社区洞察

其他会员也浏览了