Applying Linear Regression on paid marketing channels to predict leads

There are innumerable ways in which we can solve a problem; however, it is down to us to decide what approach we choose to solve the problem. We choose how we wish to tackle the problem in order to deliver the best results whilst continuously improving the old processes to bring in newer, more accurate ones.

I found myself in a similar situation, in which one of my clients asked me "Mayank, can you predict how many leads I will receive on my website in the next month?”. My initial approach was to simply average the number of leads per day to provide an answer, however, deep down I knew that there was a better method which would allow me to calculate a more accurate answer…


·        Scenario: 3 Paid Channels (Paid Search, Paid Social, Partners)

·        Historic Data: Available since January 2017. (Key Metrics - Clicks, Impressions, Cost, Leads etc.)

·        Goal: To forecast the number of leads for next month, for each channel

·        Methods: A lot. The forecasting methods that I looked into initially were – Moving Average, Exponential Smoothing, Linear and Multivariate Regression. I eliminated methods that were not relevant and went ahead to test the remaining

Initial Checkpoints - Took up 1 channel at a time. (Paid Search in this case) and pulled out the data of leads at daily granularity since January 2017. It looked somewhat like this:

No alt text provided for this image

I analysed the data to check for trends and seasonality, but did not manage to find anything significant. Then came Linear Regression, which is the most basic model of Supervised Learning algorithms in Machine Learning. The first thing to note while implementing Linear Regression is to be clear in what you want to predict (leads in this case) and what the features (or factors) are which may affect the predictor variable. Selecting features is perhaps the important task in this model. In this case, Spend was one of the most important features. I investigated additional features (such as Clicks, Impressions etc.) using a Multivariate model, however, it started exhibiting multi-collinearity which meant I had to exclude the scenario to include a multi-variate model.


Just for the sake of example- imagine that you have to predict the price of a house; the factors that will first come to your mind are - Area, Rooms, Age etc. Similarly for Leads (or any other response variable), there could be multiple predictor variables that will affect it. In the case of Paid Channel, you can apply a similar method across different channels to predict outcomes individually. Then you can add them up to predict outcomes as a whole.

Fitting Historic Data - Sometimes fitting historic data in our learning algorithm can show us different outcomes for certain combinations of dates and granularity (which was there in this case). In this case, the scenario with Quarterly data at daily granularity gave the best result. (Because the data values kept on changing for all combinations- which changed the scatter plot, and hence the equation of the best fit line).

Removing Outliers: Outliers can be sometimes be fatal in regression problems. They tend to skew your best-fit line, which will further skew the prediction. The reason for this skewness is that an outlier in your data will try to pull the line towards it. This will skew the equation as well as the prediction.

An example of an outlier before and after removal is shown below. You can see that when we removed the outlier, the R-Squared value increased from 0.96 to 0.98 and the equation of line changed slightly. This increase can be higher depending upon the variability in outliers.

In my dataset, there were many instances of such outliers. Such scenarios and variability is common in Paid Marketing Channels where sometimes you can see unusual number of leads/events at a specific spend. After removing outliers at 1, 2 and 3 Standard Deviations from the mean, I found at an interesting insight -- In multiple scenarios, and for every month, the best prediction was coming at 2 Standard Deviation from the Mean. A checkpoint was made, that whenever we pull historic data, the data will be filtered and outliers will be removed at 2 SD from the Mean. After that, we will consider the final equation to predict leads. Here are examples of few results that I got ---

No alt text provided for this image

 

I decided to go ahead with taking the historic data for same quarter last year and removing outliers beyond 2 standard deviations. For example - if you need to predict leads for July (July lies in Q3 of any year), take Q3 as a historic data set (any year) to predict leads for July 2018. The idea was to take subsequent quarters because it was giving us the best outcome in all cases combined.


Results: The prediction gave me an average of 95% accuracy. It also increased to 98% for February and September 2018. That means if actual leads were 1,000, our prediction at an average was 950. Not very good, but it was better than the averaging method we were applying before. To see the significance of the historic data fit was important in this case; following are the few prediction accuracies for different date ranges –


Time Period (Daily granularity) -

·        Last 3 months - 78% accuracy

·        2017 YTD - 84% accuracy.

·        Last 3 months + Same Quarter Last Year (If you are predicting for January 2018, take Q1 2017) - 87.5%

·        Same Last Quarter only - 95% accuracy.


 

Model Diagnosis - Checking Model for Assumptions - Many times we build a model but we do not know if we have built it correctly. This is where assumptions come into play. Model checking is about seeing if the assumptions for the specific model are true. Assumption checking is important because if your model satisfies all the assumptions, then it might be correct. However, this is a necessary but not a sufficient condition. Just for the sake of convenience, here are the assumptions of Linear Regression -

·        There should be no Heteroscedasticity.

·        Mean of residuals or error terms should be zero.

·        Linear and additive relationship between response and predictor variable.

·        Error terms follow Normal Distribution

 

 

Importance of Visualizing data - There are many tools, which give you instant solutions to the above Regression problems with numbers and results. (Excel, R, Tableau or any other BI tool). You just have to feed the data, call a function and run a few commands to get the answer. However, we really need to deep dive and understand what we are trying to do. Even in simple regression problem, visualizing data will give you an idea on what you want to fit in your data - is it a straight line, a curve, an exponential line or something else? Sometimes, tools (Excel, in this case) will tell you that your data has a good R squared value, you might get a gut feeling that you should go ahead and apply linear regression, but is it really so? Sometimes this can be deceptive. A high R squared value might not always mean that the model is a good fit.

 

For example, the graph below shows an R squared value of 84%, while from the plot; we can clearly see that a straight line is not a good fit to this regression problem. (An exponential or a polynomial line would be a better fit).


No alt text provided for this image


Quick Recap:

·        Linear Regression - Trying to understand the application.

·        How to actually build up a model from scratch - Fitting historic data, Identifying features, How many features, Finding the fit for data.

·        Finding points 1,2,3 standard deviations from the Mean, Removing Outliers , Getting New Equation

·        Results

·        Checking for Assumptions.



Sometimes while solving such scenarios (and certainly the advanced ones), we often are struck and we feel that there are no solutions. We feel that we have found out a solution, and then we come to know that there are many cases that we have to look into.

However, the key is to be persistent. Machine Learning models are just like swimming, you need to apply them to know about them, theoretical knowledge and tools will not take you too far.

 

                                                             

 

 

Abhinav B.

APAC Martech Head | Infosys Consulting | Adobe Community Member of The Year 2024 | Adobe Community Advisor | Designing Scalable Martech Solutions That Get Results

5 年

Awesome Mayank ! Keep up the good work buddy

Kunal Pasricha

Lead Software Engineer @MakeMyTrip

5 年

Good One ??

Kush Soni

Data strategy | Data Science & Analytics | Business Intelligence | Relationship builder | Program Manager

5 年

Good one Mayank... Appreciate you are sharing your experiences with all... Keep doing this . All the best for your next one..and congratulations for your first one.

要查看或添加评论,请登录

Mayank Jain的更多文章

社区洞察

其他会员也浏览了