Article 1 - The Predicament of Predictors
"Minimizing error is the holy grail, acknowledging the fact that there will always be some amount of irreducible error."
I want to begin with this simple but profound statement, which forms the foundation of my article series titled "The Predicament of Predictors."
Before we continue - These views are personal. Just like in statistics, please allow for a margin of error, though efforts have been made to minimize them. These articles are intended to spark the interest of general readers in statistics, machine learning, and AI.
In this series of articles, we will delve into some interesting stories related to statistics, machine learning and applications in AI that will hopefully pique the readers' interest in these fields. We will touch upon some core concepts just enough so that the reader is familiar with the common lingo used in ML. We will go a level deeper where needed, while keeping the content easy and light enough to be enjoyable and understandable.
I will use a minimal amount of mathematical equations, avoiding extensive textbook formulas (that are part of any standard statistics and ML course), and focus on explanations that should be sufficient to grasp the concept.
We will start with an example to illustrate the context of the error in question and how it can be minimized in the scenarios covered. Once we have built up the essential vocabulary, we will explore one of the most exciting competitions in Machine Learning, which featured a million-dollar prize.
Let's start with a simple, relatable real-life example.
Imagine you’ve opened an ice cream shop in a large multi-story mall with high foot traffic (the reader can substitute the business with one of their choice under similar conditions). The challenge is that there are massive food courts and many other ice cream shops, so people are spoiled for choice. You want to attract visitors to your shop, break even on your investment, and eventually turn it into a profitable business. You consider investing in advertising and notice that there are prime digital advertising screens at various entrances of the mall.
Impact of a single predictor
You want to understand the impact of advertising spend on sales. The shopping mall's pre-sales commercial and advertising team, that manages the digital ad-screens, share with you the sales data from similar businesses that advertised with them. The blue dots in the graph below show the data: Advertisement Spend (we will call it as a predictor, an independent variable. Its also also often referred to as a "feature") and Sales (we will refer to it as the response variable, or dependent variable—dependent on the predictors).
?
Linear Regression
The pattern appears to be linear—there is an increase in sales with an increase in advertising spend. This known data can be used as "training data" in the model to train it to make predictions and then the model can be tested on "test data" (data that the model has not yet seen, not been trained on) to evaluate the accuracy of the model.
We need to select a model that can estimate sales with minimal variation or error (i.e., the difference between the actual value and the estimated value of sales should be as small as possible). If we want to draw a line (known as the regression line) that shows such a relationship across this seemingly linear relationship to help us predict sales based on advertising spend, then we are looking for a line with a slope so that the error is minimized.
The section below delves deeper into some math, but feel free to just go through the explanations ignoring the equations.
Since we are talking about a regression line, lets go back to our geometry classes where we learned that the slope-intercept equation for a non-vertical line is: Y = mX + b.
Where b is the y-axis intercept and m is the slope. m and b are known as coefficients or parameters.
For the scenario in discussion, we will use a linear regression model to find the regression line that is represented as follows:
In our example Y (dependent variable) represents the actual sale value (known to us) based on X the ad-spend (independent variable). These are the blue dots in the graph above. The selection of the values for the coefficients determine how close we are to the actual value and this equation also accounts for any errors to reach the actual value Y.
Understanding Errors - RSS, MSE and RMSE
In simplest of words - an error is the difference between the expected value (the actual value) and the estimated value. To indicate that a value is an estimated value we mark the variable with an accent called as the hat symbol. The equation for estimated values is represented as:
For an ith estimate the equation would be
The residual error "e" for the ith value will be the difference between the ith actual value, the y variable, and the estimated ith value represented by the y with a hat accent on it, as shown below:
In the Ordinary Least Squares (OLS) method, that we will use to determine the best fit of the regression line, we define the residual sum of squares (RSS) to be:
What does the above equation mean? For all the estimates that we have, we need to find a set of values so that the difference between the actual values and estimated values (residual error) is the least. We square the residual errors and add them. The combination that adds up to the least amount is what we are after, in this OLS method.
Can you guess why are we squaring the errors and then adding them up, why not simply add up the residual errors? We will find that out in a while.
Note that we called out earlier that the accuracy of the coefficients or parameters is important to determine how close are we to the actual value. We also need to consider the standard error of the coefficients we are using, as results may vary with repeated sampling, as different data samples are chosen. Once we know the standard error, we can compute the confidence interval, which is expressed as a percentage. This interval shows the probability that the true value of the coefficients will fall within a certain range. Getting into details is for another discussion but be aware that we need to understand and tweak several aspects of the predictors including the coefficients (Additional reading: Readers who wish to explore further can look into the equations for finding the minimal values for the parameters, in our example β0 and β1, and the equations for determining confidence intervals).
Mean Squared Error (MSE) and the Root Mean Squared Error (RMSE).?
We want to minimize the overall error and that could be represented as the Mean Square Error (MSE) or the Root Mean Square Error (RMSE).?
The Mean Squared Error is the average of the squared differences between the observed actual values (the dots in the graphs above) and the predicted values (the regression line or the plane in the graphs above). It gives an idea of how far the predictions are from the actual data points on average.
The Root Mean Squared Error is the square root of the MSE. It provides a measure of the average magnitude of the errors in the same units as the dependent variable, making it easier to interpret.
Why do we take the sum of square of errors and not just directly the sum of the errors? It is because a simple average could cancel out the errors due to positive (above the line or the plane) and negative errors (below the line of the plane in the graphs above). Squaring solves this problem.
Impact of a multiple predictors
Now as you gain confidence in the impact of advertising through the in-mall digital screens, you consider advertising on social media as well. You gather the data on your ad spend on both the channels and monitor the sales. With the data you have you run it through the OLS model and can determine the impact of ad spend across multiple channels on sales.
In the graph above we consider two predictors - Digital Screens ad spend and social media ad spend. The least square model gives us a plane now, with the plane passing through points selected so that the residual sum of squares is the least.
Now lets increase the complexity - continuing with our earlier model, we add more number of predictors that could possibly have an effect on the response or the dependent variable Y (for example Ice Cream Sales), giving us the model represented by:
In our example - we also want to decide if there could be a revenue uplift by advertising across additional channels like a local newspaper that is most read in the region/suburb your shop is located. The analysis becomes more complex because we now have multiple predictors to consider. For instance, is advertising on in-mall digital screens enough? Do social media and newspaper ads make a significant impact, and if so, which combination is most optimal for ad spend? Should you choose in-mall screens and social media, or in-mall screens and newspaper? What about the timing of the ads—during summer or winter, on weekends or weekdays? How do outliers like a nearby rock concert or festive season affect the results? What is the correlation between these channels and their impact on sales? How can you avoid wasting money on ineffective media spend? In all of these scenarios, you want to minimize prediction error and get the most value for your money.
As we see, the challenge of dealing with multiple predictors quickly becomes complex. There are many questions to address: Are the predictors uncorrelated (which would make them easier to manage)? If they are correlated, does that add complexity? The variance of the coefficients could increase dramatically, and changes in any of the predictors could quickly alter the results. This makes it challenging to rely on a model that worked earlier with fewer predictors (or independent variables or features). When multiple predictors change together, predictions can become complex. How do you handle the non-linearity of the data and the impact of outliers?
Look at the sample output of a simulation of sales performance considering 3 channels, 2 online and one offline (newspaper) and impact of each. When considered all together it shows newspaper having the least impact on sales for the dataset considered. In a class on statistics and ML we will be asked to analyse the results and of interest to us, in particular would be the co-efficients and the p-value - do you notice anything different for Newspaper spend vs others?
Machine learning models that analyze multiple businesses, factors impacting them, and effectively map more predictors can identify growth trends, provide risk analysis, and uncover new opportunities across various fields, from business to governance to town planning.
In digital marketing, particularly for businesses with an online presence and significant traffic, ML algorithms are commonly used for personalization, automated targeting, recommendations, reducing customer churn, deflecting calls from call centers by pre-emptively proposing solutions, lowering operational costs, improving NPS, optimizing advertising, reducing ad spend wastage, and effective bidding in ad marketplaces, among other applications. Several well-known algorithms can be leveraged, including logistic regression, random forests, support vector machines, clustering, bagging, boosting, and more.
Multiple algorithms and models
A series of algorithms and models come into play to determine the best fit for the data. A model is made up algorithms and when this model is applied to a specific problem it best fits the data to reach the outcome.
领英推荐
Various measures and methods can be used to reach the desired outcome like F-statistics, subsets regression, p-value and variable significance, forward or backward stepwise selections, and model selection methods like AIC, BIC, adjusted R2, cross-validation, etc.
Fortunately, we don't need to calculate these metrics by hand or solve complex equations. Many well-known models and algorithms are now available in easy-to-use software libraries that simplify the process of making predictions.
An analysis of ad spend across multiple channels and its impact on sales might look something like this (using Python, R, GRETL, or your preferred software). Don’t worry about the various statistics in the results below—just understand that many measures are available to analyze or describe the predictions made and to test the model.
A pairwise scatter plot for our example of measuring the impact of various ad channel spend on sales may look like the following diagram helping us to make a decision:
Once we have a model that we consider the best fit for the training data, we apply it to the test data to see how close the results are to the expected or actual values. This helps ensure that the predictions fall within our expectations or within the error tolerance limits we’ve set, allowing us to select the model that best fits the data.
Model selection, goodness of fit, and prediction
Before we move on to the next section, it's important to note that the methods we used to measure error like MSE and RMSE, may not apply to all problems. We considered quantitative or numerical data in our examples. Data could take many forms that are not numerical. For instance, in classification tasks we may have discrete values like a set of labels under which we categorize the information based on certain qualification criteria that may not follow any numerical sequencing or continuous order to make a prediction. The evaluation may be based on probability of being able to classify information under a certain category. For example in case of handwriting recognition, satellite image analysis, genetic sequencing, or survival analysis, different models could apply that would best fit the data and have different methods to evaluate model errors. This calls for a deeper discussion and will be the subject of our next article in the series focusing on "Logistic Regresion" where the importance of new methods will be understood like the likelihood function.
To determine the right model, best fit for the problem in hand, we need to carry out hypothesis testing, run tests with training data using different algorithms that help in building the model, perform model selection from various models by measuring the goodness of fit by running the models on test data.
There are techniques to evaluate the accuracy of a model beyond testing across intial set of training and test data. Cross Validation technique splits the data into many folds and these are interchangeably and iteratively used to set the data as training and test data to have more diverse data sets as would be seen in real life and to make accurate predictions.
With this, we’ve covered several core concepts in a straightforward manner. These concepts (training data set, test data set, errors, MSE, RMSE, Cross Validation) should help us understand the next story better. At the end of the article I am also sharing a summary of the tools of the trade I used for linear regression and graphs.
The Million Dollar Prize for Improving Predictions (Lowering RMSE from 0.9525 to 0.8572 or Below)
With the fundamentals covered, we are now ready to dive into the story of an exciting competition that gained immense popularity and is often cited in machine learning courses as a prime example of how technological innovation, collaboration, and management excellence can come together.
The Challenge
This competition was launched by Netflix and started on October 2, 2006, with a planned end date no later than October 2, 2011. The challenge? Improve Netflix’s Cinematch algorithm (used for making movie recommendations) by 10%—specifically, reduce the RMSE (Root Mean Square Error) from 0.9525 to 0.8572 or lower. The challenge of predictors and making accurate predictions is evident here, as Netflix estimated that it could take years to achieve an RMSE of 0.8572.
In simpler terms, what was the goal? It was to create more accurate movie recommendations personalized for each individual based on their movie preferences, considering user reviews and details of other movies similar to what the user has liked. Solving a problem like this in a few years, even with some of the best minds worldwide, highlights the challenges and excitement inherent in statistics, machine learning, and AI!
One of the most compelling messages from the Netflix competition was:
“So if you know (or want to learn) something about machine learning and recommendation systems, give it a shot. We could make it really worth your while.”
The grand prize was a million dollars, with an additional annual progress prize of $50,000 for any team that improved the algorithm's accuracy.
The Dataset
So what was the dataset like and accompanying instructions? From the competition description:
Lets fast-forward to the run up to the results.
The Teams
The competition received 44,014 valid submissions from 5169 different teams from an initial 41305 teams from 186 different countries that participated. Universities, private corporations and individual teams participated with great enthusiasm.
Over the first 2 months, many teams got to the halfway mark - coming within 5% reduction of RMSE (from the 10% target) but beyond that it was an uphill drive.
As months turned into years, teams not only competed but also formed alliances and learned from each other highlighting the collaborative spirit in the field of machine learning. It’s impressive to see how these partnerships and shared knowledge contributed to advancements beyond the competition itself and led to management excellence with diverse teams working together.
While teams kept their algorithms and models a secret, a statistician named Simon Funk, on Dec 11, 2006, published an algorithm to reduce RMSE, openly on the internet and gave a witty title too - Netflix Update - Try This at Home.? The singular value based algorithm worked well in the early days to make some quick gains and was used by many participants to further their work building upon it. Take a look at that article here: Netflix Update: Try This at Home
The Winning Team
It took nearly 3 years to reach the winning RMSE.?
The competition was won by BellKor's Pragmatic Chaos that had evolved to comprise of the following teams:?
Hence the team name comprising of parts of each team that formed an alliance: BellKor's Pragmatic Chaos. The team was handed over a check of 1 Million USD on 21 September 2009.
They used collaborative filtering, singular value decomposition, and an average of nearly mind boggling 800 different algorithms to reach 10.06% improvement to the Cinematch RMSE!!
The team that came 2nd (and did not win any cash prize) was The Ensemble that also reached the winning RMSE but submitted the entry 20 minutes later than BellKor's Pragmatic Chaos.
I think all the teams on this leaderboard were winners as they could apply the fruits of their research and knowledge across so many real life scenarios.
A New (unexpected) Predicament: De-Anonymization of Anonymised dataset
Interestingly, there was a second round of competition planned initially that the teams were excited about. However while the data was shared in an anonymous manner, the predicament of predictors surfaced again when a team of statisticians, took the Netflix anonymized dataset and de-anonymized part of it mapping it back to some of the people accurately! Seemingly, on the basis of this statistical finding and impact on user privacy - the next round of competition was not held. This predicament of predictors was not expected as part of the competition but a real world problem that the research led to. If you are interested, read the paper published by the statisticians here: Arvind Narayanan and Vitaly Shmatikov.?Robust De-anonymization of Large Sparse Datasets . IEEE Symposium on Security and Privacy. 2008.??
This competition was a game changer, sparking a greater passion for machine learning and leading to the development of algorithms and models with many impactful applications.
The Ground We Covered
I hope with this article the reader is aware of some essential lingo used in the world of statistics, some simple methods used to make the right predictions - whether its to sharpen your digital marketing sales strategy optimizing your advertisement spend for highest RoI or to make better personalized recommendations for your products or movies as we saw in the Netflix example.
We covered a very high level introduction to several core concepts so that they register in your mind as essential language skills to build upon in the world of statistics like response or dependent variable, independent variables (aka predictors or features), estimated values represented by a hat symbol, linear regression models with one or more predictors, residual sum of squares (RSS), mean square error (MSE), root mean square error (RMSE), regression lines and planes, standard errors of coefficients, confidence intervals and various models and methods to solve different types of problems, especially as the number of predictors increase or the type of data changes.
In next set of articles we will delve deeper into the exciting world of statistics, machine learning and AI.
Acknowledgement:
Views are personal and any errors are mine.