How to do a Logistic Regression in R
Its a big post so don't give up! And by any chance if you feel like closing the window, please take the pain to scroll down to the bottom and comment "this post is awesome!"...ummm...Yeah.
So, getting down to business -
Regression is the "I TOLD YA SO!" dude of Data science. So, yup, it predicts!
A safe definition would be -
Regression is the statistical technique that tries to explain the relationship between a dependent variable and one or more independent variables. There are various kinds of it like simple linear, multiple linear, polynomial, logistic, poisson etc
Blah!
This post has been dedicated to Logistic regression. I will start by giving some theory around Logistic regression, I know theories are kind of boring but one must know the concepts. I have tried to keep the language as simple as possible, but if you find it difficult to understand then that's great, it tells I am getting smarter.
Then I have taken the famous Titanic data set and run a regression on that in R. In between you will come across a video on Receiver Operating Characteristics. I highly recommend you to watch it.
What is Logistic regression?
When the dependent variable is categorical in nature , the regression we will use is logistic regression.It can be Binomial, it can be Multinomial, it can be Ordinal. I will restrict the scope of this Post to Binomial.
What is Binomial Logistic Regression?
When the dependent variable of a model is dichotomous in nature, that is it can have only two outcomes, the regression is Binomial Logistic Regression.
So, here we have a "Yes" or "No" kind of dependent variable. The dependent variable for logistic regression is the probability of success of an event.
Now, when probability is a dependent variable, first thing is that it is limited between 0 and 1, and it will be extremely difficult to limit the combination of weighted independent variables between this range. So, instead of taking the probability we take the odds, which is ratio of probability of success of an event to probability of its failure. Now, odds can be anything between 0 to infinity. But, the problem is that it will always remain positive which again puts a limitation on the formula. To handle this we take the logarithm of odds. So when the odds is less than one, the log value is negative and when the odds are more than one, the log value is positive. And when it is 1, log is zero.
With this we get a generalized linear model that predicts a categorical dependent variable
Where,
P is the probability of success and 1 - P is probability of failure.
Beta 0 is the intercept, that is the value when all the independent variables are zero
x represents dependent variables
Beta 1,2,3.......k will represent the coefficients of the dependent variables.
Now, pay attention here -
The Logisitic regression is NOT based on method of LEAST SQUARES, it is rather based on MAXIMUM LIKELIHOOD, that is repeated iterations are made till we reach a point where we stop getting marginal improvements. At this point the model is said to attain convergence
A few things you need to keep in mind while massaging the data, else no orgasm.
1. Make sure you don't have high multicollinerity between the independent variables. If this is the case the model will not reach convergence.
2. Don't have high number of categories within any variable. The maximum number of categories you should have is five. If you have more categories you will be required to do clustering before the regression.
3. You don't have large proportions of empty cells. If this is the case you need to remove them or replace them with constants.
Alright, I will now discuss a set by step process on How to do a Logisitic Regression in R with Titanic Data Set.
*Popping fingers*
I have taken this data from Kaggle. I am taking the train.csv one.
Step 1 : Understand the data
First I have created an object Titanic and read "Train.csv" into it. Then I have summarized it to overview its composition. Here Things like PassengerId, Name, Ticket, Fare, Cabin are having too many levels as factors. Moreover, there is no point clustering them.
I have an added reasons to remove Cabin, coz out of 891 values 687 are blank.
Step 2 : Convert the variables into proper data type and handle missing values
After Summarizing I found that R was taking variables like Survived, Pclass as integer. I converted them to factors.
Also, there were plenty of missing values for the age variable. I replaced these missing values with mean. The reason for taking the mean is that it does not change the overall mean of the variable.
Step 3 : Check for Multi Collinearity
- The regression model will not work fine if the data is having high muti collinearity
Next I took a subset of data with variables having non factor data type. You can't run correlation on variables with factor data type. Then I check for multi collinearity and this is what is get -
We can clearly see that there is not high collinearity between any of the variables. Moving on.
Step 4 : Using the 70:30 rule of thumb, generating training and testing data sets
We can clearly see that there is not high collinearity between any of the variables. Moving on.
Step 4 : Using the 70:30 rule of thumb, generating training and testing data sets
Sample() function generates a random sample. set.seed() is used to ensure that while debugging we get the same sample, else Sample() will generate different sample everytime.
Step 5: Create a model on the training data set. dev object in our case.
We are using the glm() function here, which stands for generalized linear model. Since we are using binary logistic regression, set the family as binomial(). Link is logit which is nothing but log of odds. ~. indicates that we are considering all the variables of the data set to predict the survived variable. You can check the Formulae table for more clarity on this. Lets interpret the results-
We can ignore residuals in most of the cases, since the model is not based on OLS but on Maximum Likelihood
Pay attention to the last column and the aestriks. The variables which are having the aestriks are having statistically significant relationship with the dependent variable. So, here class, age, sex, siblings are having a statistically significant relationship with the survival.
As you can observe that all the factors of the variables have not been taken. One factor from each variable has been taken as the base. The base case is always taken alphabetically or in case of numeric factors in ascending order. So here Pclass1 has been taken as base case for class and female has been taken as base case for sex.
If you look at the first column then you will observe that all the significant variables are negatively related with survival.
The AIC is the Akaike Information Criterion. You will use this while comparing multiple models. The model with lower value of AIC is better.
Null deviance is related to a model which only has the intercept and Residual deviance is related to our model with the independent variables. These are actually chi squared statistics with the corresponding degrees of freedom.
We will basically use this to test the null hypothesis that the deviance of model with only the intercept and deviance of model with the independent variables added to it is same.
If the value is significantly small, we can reject the null hypothesis which we are doing in our case.
Next we are going the test our model on the testing data set. Before we do that I will request to you learn some basics about ROC (Receiver Operating Characteristic). This video presents it in a very clear manner -
Step 6: Predict the survivals for the Testing data set
> test$Prid <-predict(Model,test,type="response")
We have created a new variable in the test data set and used the predict function to predict the probability of survival based on our model.
Step 7 : lets look at the ROC curve
We need the ROCR package for this. All the evaluations with ROCR start by first creating a predictor object, for this we have used the prediction function. As you can see the inputs are predictions of our model and actual results. So we are scoring the performance on predicted value and actual value.
Next we have used the performance function, which is used to display the various performance characteristics of the predictor object, which in our case is score. With this first we are fetching the AUC, which comes at around .89, which is a good value.
Lets have a look at confusion matrix before proceeding ahead
Next, we have plotted the curve between the TPR and FPR at various cut offs which is essentially your ROC.
As you can see that we are getting a good cut off at around 0.4
You can also visualize the model by the following simple logic -
The output of this will be -
I leave it up to you to assimilate this :)
Step 8 : Choosing the cut off. (Almost there) *Phew*
Next, I have taken a cut off at 0.4. The floor() function gives the smallest nearest integer to the input. And then I have created the confusion matrix.
Output -
So, the TPR at a cut off of 0.4 is 78.2% and FPR is 21.5%
Again, TPR is how often you are predicting survivals upon actual survivals. FPR is how often you are predicting non survivals as survivals.
Lets take another cut off
So, at cut off of .52 we get -
The TPR now is 73% and FPR is 12.5%
So, you get the point at each cut off the TPR and FPR will vary. So how do you choose the cut off? It is you who decide that, what are you looking for. You want a high TPR at the expense of high FPR or you want a low FPR at the expense of low TPR. You gotta take the call bro.
If you have stuck to the bottom, thank you. I hope I did not waste your time. I will sincerely appreciate if you can present your thoughts on the post.
Till then,
Stay Awesome!
Chasing longevity @Xandrolab | Writing Out of Singapore
9 年Excellent! Unfortunately your code for ploting the ROC curve did not work for me. Got an error of "figure margins too large". I used the following code given in yhat blog: prob <- predict(Model, newdata=test, type="response") pred <- prediction(prob, test$Marketer_IND) perf <- performance(pred, measure = "tpr", x.measure = "fpr") # I know, the following code is bizarre. Just go with it. auc <- performance(pred, measure = "auc") auc <- [email protected][[1]] roc.data <- data.frame(fpr=unlist([email protected]), tpr=unlist([email protected]), model="GLM") ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) + geom_ribbon(alpha=0.2) + geom_line(aes(y=tpr)) + ggtitle(paste0("ROC Curve w/ AUC=", auc))