登录查看更多内容

How to do a Logistic Regression in R

Sanmeet Walia

Helping Enterprises Boost Their Marketing ROI with Digital Analytics

发布日期: 2015年9月28日

Its a big post so don't give up! And by any chance if you feel like closing the window, please take the pain to scroll down to the bottom and comment "this post is awesome!"...ummm...Yeah.

So, getting down to business -

Regression is the "I TOLD YA SO!" dude of Data science. So, yup, it predicts!

A safe definition would be -

Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc

Blah!

This post has been dedicated to Logistic regression. I will start by giving some theory around Logistic regression, I know theories are kind of boring but one must know the concepts. I have tried to keep the language as simple as possible, but if you find it difficult to understand then that's great, it tells I am getting smarter.

Then I have taken the famous Titanic data set and run a regression on that in R. In between you will come across a video on Receiver Operating Characteristics. I highly recommend you to watch it.

What is Logistic regression?

When the dependent variable is categorical in nature , the regression we will use is logistic regression.It can be Binomial, it can be Multinomial, it can be Ordinal. I will restrict the scope of this Post to Binomial.

What is Binomial Logistic Regression?

When the dependent variable of a model is dichotomous in nature, that is it can have only two outcomes, the regression is Binomial Logistic Regression.

So, here we have a "Yes" or "No" kind of dependent variable. The dependent variable for logistic regression is the probability of success of an event.

Now, when probability is a dependent variable, first thing is that it is limited between 0 and 1, and it will be extremely difficult to limit the combination of weighted independent variables between this range. So, instead of taking the probability we take the odds, which is ratio of probability of success of an event to probability of its failure. Now, odds can be anything between 0 to infinity. But, the problem is that it will always remain positive which again puts a limitation on the formula. To handle this we take the logarithm of odds. So when the odds is less than one, the log value is negative and when the odds are more than one, the log value is positive. And when it is 1, log is zero.

With this we get a generalized linear model that predicts a categorical dependent variable

Where,

P is the probability of success and 1 - P is probability of failure.

Beta 0 is the intercept, that is the value when all the independent variables are zero

x represents dependent variables

Beta 1,2,3.......k will represent the coefficients of the dependent variables.

Now, pay attention here -

The Logisitic regression is NOT based on method of LEAST SQUARES, it is rather based on MAXIMUM LIKELIHOOD, that is repeated iterations are made till we reach a point where we stop getting marginal improvements. At this point the model is said to attain convergence

A few things you need to keep in mind while massaging the data, else no orgasm.

1. Make sure you don't have high multicollinerity between the independent variables. If this is the case the model will not reach convergence.

2. Don't have high number of categories within any variable. The maximum number of categories you should have is five. If you have more categories you will be required to do clustering before the regression.

3. You don't have large proportions of empty cells. If this is the case you need to remove them or replace them with constants.

Alright, I will now discuss a set by step process on How to do a Logisitic Regression in R with Titanic Data Set.

*Popping fingers*

I have taken this data from Kaggle. I am taking the train.csv one.

Step 1 : Understand the data

First I have created an object Titanic and read "Train.csv" into it. Then I have summarized it to overview its composition. Here Things like PassengerId, Name, Ticket, Fare, Cabin are having too many levels as factors. Moreover, there is no point clustering them.

I have an added reasons to remove Cabin, coz out of 891 values 687 are blank.

Step 2 : Convert the variables into proper data type and handle missing values

After Summarizing I found that R was taking variables like Survived, Pclass as integer. I converted them to factors.

Also, there were plenty of missing values for the age variable. I replaced these missing values with mean. The reason for taking the mean is that it does not change the overall mean of the variable.

Step 3 : Check for Multi Collinearity

- The regression model will not work fine if the data is having high muti collinearity

Next I took a subset of data with variables having non factor data type. You can't run correlation on variables with factor data type. Then I check for multi collinearity and this is what is get -

We can clearly see that there is not high collinearity between any of the variables. Moving on.

Step 4 : Using the 70:30 rule of thumb, generating training and testing data sets

We can clearly see that there is not high collinearity between any of the variables. Moving on.

Step 4 : Using the 70:30 rule of thumb, generating training and testing data sets

Sample() function generates a random sample. set.seed() is used to ensure that while debugging we get the same sample, else Sample() will generate different sample everytime.

Step 5: Create a model on the training data set. dev object in our case.

We are using the glm() function here, which stands for generalized linear model. Since we are using binary logistic regression, set the family as binomial(). Link is logit which is nothing but log of odds. ~. indicates that we are considering all the variables of the data set to predict the survived variable. You can check the Formulae table for more clarity on this. Lets interpret the results-

We can ignore residuals in most of the cases, since the model is not based on OLS but on Maximum Likelihood

Pay attention to the last column and the aestriks. The variables which are having the aestriks are having statistically significant relationship with the dependent variable. So, here class, age, sex, siblings are having a statistically significant relationship with the survival.

As you can observe that all the factors of the variables have not been taken. One factor from each variable has been taken as the base. The base case is always taken alphabetically or in case of numeric factors in ascending order. So here Pclass1 has been taken as base case for class and female has been taken as base case for sex.

If you look at the first column then you will observe that all the significant variables are negatively related with survival.

The AIC is the Akaike Information Criterion. You will use this while comparing multiple models. The model with lower value of AIC is better.

Null deviance is related to a model which only has the intercept and Residual deviance is related to our model with the independent variables. These are actually chi squared statistics with the corresponding degrees of freedom.

We will basically use this to test the null hypothesis that the deviance of model with only the intercept and deviance of model with the independent variables added to it is same.

If the value is significantly small, we can reject the null hypothesis which we are doing in our case.

Next we are going the test our model on the testing data set. Before we do that I will request to you learn some basics about ROC (Receiver Operating Characteristic). This video presents it in a very clear manner -

Step 6: Predict the survivals for the Testing data set

> test$Prid <-predict(Model,test,type="response")

We have created a new variable in the test data set and used the predict function to predict the probability of survival based on our model.

Step 7 : lets look at the ROC curve

We need the ROCR package for this. All the evaluations with ROCR start by first creating a predictor object, for this we have used the prediction function. As you can see the inputs are predictions of our model and actual results. So we are scoring the performance on predicted value and actual value.

Next we have used the performance function, which is used to display the various performance characteristics of the predictor object, which in our case is score. With this first we are fetching the AUC, which comes at around .89, which is a good value.

Lets have a look at confusion matrix before proceeding ahead

Next, we have plotted the curve between the TPR and FPR at various cut offs which is essentially your ROC.

As you can see that we are getting a good cut off at around 0.4

You can also visualize the model by the following simple logic -

The output of this will be -

I leave it up to you to assimilate this :)

Step 8 : Choosing the cut off. (Almost there) *Phew*

Next, I have taken a cut off at 0.4. The floor() function gives the smallest nearest integer to the input. And then I have created the confusion matrix.

Output -

So, the TPR at a cut off of 0.4 is 78.2% and FPR is 21.5%

Again, TPR is how often you are predicting survivals upon actual survivals. FPR is how often you are predicting non survivals as survivals.

Lets take another cut off

So, at cut off of .52 we get -

The TPR now is 73% and FPR is 12.5%

So, you get the point at each cut off the TPR and FPR will vary. So how do you choose the cut off? It is you who decide that, what are you looking for. You want a high TPR at the expense of high FPR or you want a low FPR at the expense of low TPR. You gotta take the call bro.

If you have stuck to the bottom, thank you. I hope I did not waste your time. I will sincerely appreciate if you can present your thoughts on the post.

Till then,

Stay Awesome!

Shantanu Kumar

Chasing longevity @Xandrolab | Writing Out of Singapore

9 年

Excellent! Unfortunately your code for ploting the ROC curve did not work for me. Got an error of "figure margins too large". I used the following code given in yhat blog: prob <- predict(Model, newdata=test, type="response") pred <- prediction(prob, test$Marketer_IND) perf <- performance(pred, measure = "tpr", x.measure = "fpr") # I know, the following code is bizarre. Just go with it. auc <- performance(pred, measure = "auc") auc <- [email protected][[1]] roc.data <- data.frame(fpr=unlist([email protected]), tpr=unlist([email protected]), model="GLM") ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) + geom_ribbon(alpha=0.2) + geom_line(aes(y=tpr)) + ggtitle(paste0("ROC Curve w/ AUC=", auc))

1 次回应

要查看或添加评论，请登录

Sanmeet Walia的更多文章

Video Series On Digital Analytics Implementation

2020年4月5日

Video Series On Digital Analytics Implementation

Hello guys, I am starting a video series in which every week I will audit the digital analytics implementation of…

6 条评论
Props vs eVars | Blog 1 of 10

2020年2月11日

Props vs eVars | Blog 1 of 10

prop vs eVar. Hello, my dear reader! Hope this post finds you in super amazing spirits! So, would you like to learn…

5 条评论
How Citi Bank is trying to rip me off!

2018年9月28日

How Citi Bank is trying to rip me off!

Hello, I want your help in spreading this case of malicious practice by Citi Bank. I had made a transaction through…

3 条评论
How to detect mobile device on Google Tag Manager, DTM & Tealium

2018年2月27日

How to detect mobile device on Google Tag Manager, DTM & Tealium

Wwwhhhooaaaa this form looks perfect AF! Lemme check how it looks on mobile Mobile responsiveness is a task, isn’t it?…
DJ, THE DOG - A SUCCESSFUL SON OF A BITCH

2017年12月22日

DJ, THE DOG - A SUCCESSFUL SON OF A BITCH

I realized something today that refreshed my learning about attitude and life. There was a pup in my housing society…

3 条评论
What is Google Tag Manager?

2017年12月15日

What is Google Tag Manager?

Google Tag Manager is a problem-solver. It’s a lifeguard that has come to rescue anyone who is drowning in the ocean of…
Urgent Requirement | Analytics Professionals | Vodafone Shared Services India ltd

2016年5月5日

Urgent Requirement | Analytics Professionals | Vodafone Shared Services India ltd

Hi All, Please refer your friends for following profile. We need two candidates with the same profile.
Lets Predict - Logistic Regressions!

2015年9月29日

Lets Predict - Logistic Regressions!

Its a big post so don't give up! And by any chance if you feel like closing the window, please take the pain to scroll…

2 条评论
How to get class of all the variables/columns of a data frame/matrix in R

2015年9月25日

How to get class of all the variables/columns of a data frame/matrix in R

Yo, At times we need to know the class of the variables of a data set. Certain classes are not compatible with certain…
Basics to Probability Distribution in R

2015年9月8日

Basics to Probability Distribution in R

Hello there, So, you wanna know about probability distributions with R eh? Trust me I had to do quite some research…

1 条评论

See all articles

How to do a Logistic Regression in R

Sanmeet Walia

Helping Enterprises Boost Their Marketing ROI with Digital Analytics

Sanmeet Walia的更多文章

社区洞察

其他会员也浏览了

Regularization in Regression: A Simple Guide to Lasso and Ridge

Ridge Regression and Lasso Regression.

How logistic regression can save the day?

Proportions as Dependent Variable in Regression–Which Type of Model?

Linear Regression (Less Linear Than You Might Think)

House Price Prediction using Simple Linear Regression

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

A Beginner's Guide to Regression: Predicting the Future, One Data Point at a Time

Regression: Evaluation Techniques

Sanmeet Walia的更多文章

Video Series On Digital Analytics Implementation

Props vs eVars | Blog 1 of 10

How Citi Bank is trying to rip me off!

How to detect mobile device on Google Tag Manager, DTM & Tealium

DJ, THE DOG - A SUCCESSFUL SON OF A BITCH

What is Google Tag Manager?

Urgent Requirement | Analytics Professionals | Vodafone Shared Services India ltd

Lets Predict - Logistic Regressions!

How to get class of all the variables/columns of a data frame/matrix in R

Basics to Probability Distribution in R

社区洞察

其他会员也浏览了

Regularization in Regression: A Simple Guide to Lasso and Ridge

Ridge Regression and Lasso Regression.

How logistic regression can save the day?

Proportions as Dependent Variable in Regression–Which Type of Model?

Linear Regression (Less Linear Than You Might Think)

House Price Prediction using Simple Linear Regression

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

A Beginner's Guide to Regression: Predicting the Future, One Data Point at a Time

Regression: Evaluation Techniques