Zero-Inflated Poisson: Regression Models
Krystal Cooper
Product & Program Manager | Creative Practitioner | Consultant | Producer | Futurist | Innovator | Educator |
A few weeks ago, I promised I would start chronicling my adventures in statistics. To be fair, I only have taken 1 statistics course as an undergraduate. My tech-speak is usually on point but my stat-speak needs some work. That introductory course may have given me a bit of the lingo and concepts but nothing like what I'm learning in graduate school.
One of the statistical models I have been learning how to use is regression analysis. If you are like me, I had no prior reason to use statistical models but I had a basic idea of how they worked in theory. Now I'm actually getting some practical experience of when you can use these kind of models in forecasting.
My first step was trying to get a handle on dependent and independent variables. Using statistics (at least in my mind) has to be a daily thing until you get the hang of it and the only way to do that is to layer each step one by one. Lately we have been looking at data sets with a high number of zero-values or excess zero counts. I'm still observing how to select the best model for these kinds of data, but specifically that will have a zero inflated probability distribution. I think once I get the hang of cross-validation I will be able to choose and eliminate models faster. "The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model".
We were presented with trying out Poisson Regression (also known as log-linear) to create a generalized linear model (glm). One of my challenges so far has been to determine what the correct offset is through trial and error, as well as figuring out if something is overdisbursed. I'm not ashamed to say I keep a little notebook with all of these terms and various models. Learning this in one semester, can make your head spin.
If you are still reading and I hope you are, I can tell you that this time last year I had never heard of a Poisson Distribution. It was named after Siméon Dennis Poisson, a French Mathematician. I knew that le poisson meant fish in French and that was about it. Now along the way I've also been using RStudio, which is an adventure all by itself. So now I have my buddy Monsieur Poisson, his distribution model and a zero inflated data set and with a wave of my magical statistics wand you get Zero-Inflated Poisson (ZIP).
Thanks to the magic of super awesome data science mentors and google, there are dozens of tutorials on how to use this in R. The folks over at UCLA have great examples. In my next blog post I will talk about p-values and goodness of fit, 1) it's good practice because I typically don't go on about statistical models while I'm having an In & Out Burger and 2) I'm learning that the more I blog and speak about statistics the more confident I feel trying new machine learning techniques.