Predict your future e-commerce sales and understand why
In a nutshell:
Introduction
In the dynamic landscape of e-commerce, accurately predicting your sales is crucial for ensuring optimal inventory management and ultimately, maximizing your revenue. But that's not only about it, understanding the impact of marketing expenditure, especially during campaign periods, is essential for crafting effective strategies.
In other words, imagine you're trying to predict the future sales of your online store. You know that past sales can give you hints about what's going to happen, but the sales data might be influenced by many things at once, like marketing campaigns, seasonal trends, or even random events. For example, are you able to measure and understand all effects that made your promotional campaign a total success or did you just get lucky because an famous influencer randomly promoted your product? Or can you become more efficient with your marketing budget if you'd know that most of the people that buy your products during Black Friday are already loyal customers? Ugh, that seems a lot to take in now. But no worries, there are ways to get closer to the perfect scenario.
I personally like these three different libraries that offer tools for this purpose. In this article, I will compare the forecasting models VARIMA from DARTS library, Prophet by Meta, and Random Forest from SKForecast. Why? These models offer sophisticated tools to analyze historical sales data alongside external factors such as marketing expenditure, helping forecast sales with greater precision and anticipate the effects of campaigns or web traffic trends on future sales.
If you want to access the full technical documentation and models, you can find everything here: https://github.com/algerza/ecommerce_time_series_forecast/tree/main
Loading the imperfect data
The data provided is synthetic and entirely fictitious. Consequently, it may not accurately reflect situations encountered in real e-commerce data. To simulate realistic scenarios, deliberate gaps have been introduced to mimic instances when servers were off or bugs occurred. While this article does not delve deeply into cleaning time-series datasets, I thought it may be useful to include these "effects" to make the fake data a bit closer to reality :)
The dataset provides different data points that we can use to make my predictions more predictable: 'date', 'users', 'sessions', 'marketing_cost', 'clicks', 'impressions', 'click_through_rate', 'avg_time_on_page', 'bouce_rate', 'units_sold', 'is_campaign_period', 'day_of_week', 'month_number'. Beware that adding all the data not always brings better results, in fact, you may be adding more noise!
Just to have an overview, this is how my units sold over time look like:
Darts: Forecasting with VARIMA
We will use VARIMA, which stands for Vector Autoregressive Moving Average, is a mathematical tool that helps in making predictions by taking into account the relationships between multiple variables over time. In simpler terms, it's like looking at how different factors interact and influence each other in the past to forecast what might happen in the future.
As a difference to other models we will use later on, with VARIMA we can model different time-series together in a single model (i.e. users and units sold), as well as including other external factors (exogenous variables or regressors). How does it work? It is composed by two methods:
Combining these two methods, VARIMA can capture complex relationships and patterns in the data, allowing for more accurate forecasts. In the context of e-commerce sales forecasting, this is crucial because online sales can be affected by various factors like promotions, seasonal trends, or even external events like holidays or economic changes.
And now into the practical part: the model learns with the variables and will predict the number of units sold based on future inputs like marketing cost and the campaign period label (i.e. is it a date when we plan to run a marketing campaign?). It is advisable to add these exogenous variables or regressors in order to improve the model accuracy, as well as to improve the explainability of it. This is the output of VARIMA:
It did perform fine on forecasting sales on a campaign period, which is always challenging. Let's see how other models do!
Machine Learning approach: Random Forest
Machine Learning methods like Random Forest offer a flexible and data-driven approach to e-commerce sales forecasting, leveraging algorithms to learn patterns from historical data and incorporate various factors that influence sales.
Some of the main benefits and why I like this library:
领英推荐
In this case, we will use a different approach compared to VARIMA - Random Forest. Why? Time series data often exhibit non-linear relationships and complex interactions between variables. By using decision trees, it can naturally handle these non-linearities without requiring explicit feature engineering or transformation of the data.
However, Random Forest is generally not well-suited for extrapolation outside the range of the training data. This means it may struggle with predicting future values that significantly differ from the patterns observed in the training set. Moreover, when dealing with imbalanced datasets (where one class is much more frequent than the others), Random Forest may favor the majority class and have reduced performance on minority classes. Techniques like class weighting or resampling may be required to address this.
This time, I will input only the marketing expenditure, units sold and campaign period label, and the output of SKForecast is quite nice for this timeframe!
Additionally, I really like that SKForecast provides the function get_feature_importances() so you can obtain the main parameters that influenced the prediction. This is veeeery useful to discuss with non-technical stakeholders.
What we can see is that marketing expenditure has the greatest impact on predicting sales (i.e. the higher cost, higher sales). Beware this may not represent a real case scenario. But continuing with the chart, the next parameter is the marketing period, representing 20% of the impact on the forecast. Next are lags - what are those? lag_1 means that it predicts the units sold based on the unit sold on the previous day, lag_2 the units sold 2 days ago, and so on.
Finally, Prophet from Meta
Prophet offers a robust and flexible framework for time-series forecasting, leveraging Bayesian methods to capture trends, seasonality, and holiday effects while providing measures of uncertainty. Its intuitive interface and automatic model fitting make it accessible to users with varying levels of expertise, making it a popular choice for forecasting applications across different industries. How does it work?
Prophet did alright, but did not provide better results than SKForecast forecasting during campaign periods:
Compare the 3 models
As you could see, sometimes the model may predict too high, and sometimes too low. In order to determine which is the best performing model for our data and the selected time frame, we will use two important measures:
By looking at both MAE and RMSE, we can figure out which method does the best job of predicting sales. Ideally, we want a model with both low MAE and RMSE because that means it's making the most accurate predictions overall.
Conclusions
As always in data, the choice between models and libraries is not obvious. Each method has its strengths and weaknesses, so it's essential to consider your specific forecasting needs and the characteristics of your data when selecting the most suitable tool for your application.
Hope you enjoyed the article! You can also find me at:
Data scientist | Best-selling instructor | Open-source developer | Book author
7 个月masterpiece :)
Performance & Digital Marketer - User Acquisition | Retention | Revenue | eCommerce & App Marketing
7 个月It's fascinating to delve into time-series forecasting challenges. Understanding outliers' impact on models is crucial for accurate predictions. Exploring various methods enriches your knowledge base Alvaro Ager
Data Science Manager @ Cabify
7 个月Great post Alvaro Ager! Very interesting results
Senior Data Scientist focused on ML and Forecasting ? Helping teams gain business insights and scale with data-driven strategies ? Co-Author of skforecast
7 个月?Buen trabajo Alvaro Ager!
Thanks for sharing how to use skforecast Alvaro Ager!