Predict your future e-commerce sales and understand why

Predict your future e-commerce sales and understand why

In a nutshell:

  • You can input external factors such as marketing cost, web traffic or if the date is a promotional day into the model so it can forecast more accurately. However, the more doesn't mean the better, you can introduce noise too.
  • Each model offers different performance depending on the variables you input, size of your dataset and the type of dates you're forecasting over (i.e. regular week or promotional period such as Black Friday). For example, extrapolation outside the range of the training data and imbalanced datasets are key factors to keep in mind while using Random Forest for time-series because you may get lower performance.
  • Depending on the model you pick, SKForecast offers you great tools for model explainability - which is great to communicate with non-technical stakeholders. For example, what factors contributed the most to generate my predictions?


Introduction

In the dynamic landscape of e-commerce, accurately predicting your sales is crucial for ensuring optimal inventory management and ultimately, maximizing your revenue. But that's not only about it, understanding the impact of marketing expenditure, especially during campaign periods, is essential for crafting effective strategies.

In other words, imagine you're trying to predict the future sales of your online store. You know that past sales can give you hints about what's going to happen, but the sales data might be influenced by many things at once, like marketing campaigns, seasonal trends, or even random events. For example, are you able to measure and understand all effects that made your promotional campaign a total success or did you just get lucky because an famous influencer randomly promoted your product? Or can you become more efficient with your marketing budget if you'd know that most of the people that buy your products during Black Friday are already loyal customers? Ugh, that seems a lot to take in now. But no worries, there are ways to get closer to the perfect scenario.

I personally like these three different libraries that offer tools for this purpose. In this article, I will compare the forecasting models VARIMA from DARTS library, Prophet by Meta, and Random Forest from SKForecast. Why? These models offer sophisticated tools to analyze historical sales data alongside external factors such as marketing expenditure, helping forecast sales with greater precision and anticipate the effects of campaigns or web traffic trends on future sales.

If you want to access the full technical documentation and models, you can find everything here: https://github.com/algerza/ecommerce_time_series_forecast/tree/main



Loading the imperfect data

The data provided is synthetic and entirely fictitious. Consequently, it may not accurately reflect situations encountered in real e-commerce data. To simulate realistic scenarios, deliberate gaps have been introduced to mimic instances when servers were off or bugs occurred. While this article does not delve deeply into cleaning time-series datasets, I thought it may be useful to include these "effects" to make the fake data a bit closer to reality :)

The dataset provides different data points that we can use to make my predictions more predictable: 'date', 'users', 'sessions', 'marketing_cost', 'clicks', 'impressions', 'click_through_rate', 'avg_time_on_page', 'bouce_rate', 'units_sold', 'is_campaign_period', 'day_of_week', 'month_number'. Beware that adding all the data not always brings better results, in fact, you may be adding more noise!

Just to have an overview, this is how my units sold over time look like:


Darts: Forecasting with VARIMA

We will use VARIMA, which stands for Vector Autoregressive Moving Average, is a mathematical tool that helps in making predictions by taking into account the relationships between multiple variables over time. In simpler terms, it's like looking at how different factors interact and influence each other in the past to forecast what might happen in the future.

As a difference to other models we will use later on, with VARIMA we can model different time-series together in a single model (i.e. users and units sold), as well as including other external factors (exogenous variables or regressors). How does it work? It is composed by two methods:

  • Vector Autoregression (VAR): This part looks at how each variable, like sales, marketing spending, or website traffic, depends on its own past values and the past values of other variables. For example, it might find that sales today are influenced not only by sales yesterday but also by the amount of money spent on advertising two days ago.
  • Moving Average (MA): This part smoothens out the data by taking averages over consecutive time periods, which helps in spotting trends and patterns.

Combining these two methods, VARIMA can capture complex relationships and patterns in the data, allowing for more accurate forecasts. In the context of e-commerce sales forecasting, this is crucial because online sales can be affected by various factors like promotions, seasonal trends, or even external events like holidays or economic changes.

And now into the practical part: the model learns with the variables and will predict the number of units sold based on future inputs like marketing cost and the campaign period label (i.e. is it a date when we plan to run a marketing campaign?). It is advisable to add these exogenous variables or regressors in order to improve the model accuracy, as well as to improve the explainability of it. This is the output of VARIMA:




It did perform fine on forecasting sales on a campaign period, which is always challenging. Let's see how other models do!


Machine Learning approach: Random Forest

Machine Learning methods like Random Forest offer a flexible and data-driven approach to e-commerce sales forecasting, leveraging algorithms to learn patterns from historical data and incorporate various factors that influence sales.

Some of the main benefits and why I like this library:

  • ForecasterAutoreg incorporates an autoregressive component, which means it considers how past values of the target variable (i.e. units sold) influence its future values. So it learns that sales today are related to sales in the previous days or weeks (lags). In order to explain the predictions, you can obtain the feature importance and/or calculate the SHAP values - which is extremely useful to communicate with non-tecnical stakeholders.
  • It is very simple to input exogenous variables with SKForecast. These variables like marketing spending, website traffic, or external factors like holidays can provide valuable context and improve the accuracy of forecasts by capturing additional information.

In this case, we will use a different approach compared to VARIMA - Random Forest. Why? Time series data often exhibit non-linear relationships and complex interactions between variables. By using decision trees, it can naturally handle these non-linearities without requiring explicit feature engineering or transformation of the data.

However, Random Forest is generally not well-suited for extrapolation outside the range of the training data. This means it may struggle with predicting future values that significantly differ from the patterns observed in the training set. Moreover, when dealing with imbalanced datasets (where one class is much more frequent than the others), Random Forest may favor the majority class and have reduced performance on minority classes. Techniques like class weighting or resampling may be required to address this.

This time, I will input only the marketing expenditure, units sold and campaign period label, and the output of SKForecast is quite nice for this timeframe!




Additionally, I really like that SKForecast provides the function get_feature_importances() so you can obtain the main parameters that influenced the prediction. This is veeeery useful to discuss with non-technical stakeholders.

What we can see is that marketing expenditure has the greatest impact on predicting sales (i.e. the higher cost, higher sales). Beware this may not represent a real case scenario. But continuing with the chart, the next parameter is the marketing period, representing 20% of the impact on the forecast. Next are lags - what are those? lag_1 means that it predicts the units sold based on the unit sold on the previous day, lag_2 the units sold 2 days ago, and so on.


Finally, Prophet from Meta

Prophet offers a robust and flexible framework for time-series forecasting, leveraging Bayesian methods to capture trends, seasonality, and holiday effects while providing measures of uncertainty. Its intuitive interface and automatic model fitting make it accessible to users with varying levels of expertise, making it a popular choice for forecasting applications across different industries. How does it work?

  • Trend Modeling: Prophet captures the overall trend in the time-series data using a piecewise linear or logistic function. Instead of assuming a single global trend, Prophet allows for changes in trend direction at specific points in time, known as changepoints. These changepoints are automatically detected based on the data or can be manually specified by the user.
  • Seasonality Modeling: Prophet accounts for periodic fluctuations or seasonal patterns in the data. It automatically detects weekly, monthly, and yearly seasonalities using Fourier series expansions. This allows Prophet to capture the repetitive patterns that occur at fixed intervals within the time series.
  • Holiday Effects: Prophet allows users to specify custom holidays and events that may impact the time series. These holidays can have both positive and negative effects on the data, and Prophet includes them as additional components in the forecasting model.
  • Uncertainty Estimation: Prophet provides uncertainty intervals around the forecasted values, allowing users to assess the reliability of the forecasts. These uncertainty intervals account for both the inherent variability in the data and the uncertainty associated with the model parameters.
  • Automatic Model Fitting: Prophet automatically fits the forecasting model to the data using a Bayesian framework. This involves estimating the parameters of the trend, seasonality, and holiday components while accounting for uncertainty. Prophet uses Markov Chain Monte Carlo (MCMC) methods or optimization algorithms to sample from the posterior distribution of the parameters and obtain the most likely values.
  • Forecasting: Once the model is fitted to the data, Prophet generates forecasts for future time periods based on the learned patterns and trends. These forecasts include point estimates as well as uncertainty intervals, providing users with a range of possible outcomes.

Prophet did alright, but did not provide better results than SKForecast forecasting during campaign periods:



Compare the 3 models

As you could see, sometimes the model may predict too high, and sometimes too low. In order to determine which is the best performing model for our data and the selected time frame, we will use two important measures:

  • Mean Absolute Error (MAE): This tells us how far off, on average, our predictions were from the actual sales numbers. Lower MAE means our guesses were closer to the real sales numbers, which is what we want.
  • Root Mean Squared Error (RMSE): This is similar to MAE, but it emphasizes bigger mistakes more. It helps us understand how much our predictions varied from the actual sales, considering both small and large mistakes.

By looking at both MAE and RMSE, we can figure out which method does the best job of predicting sales. Ideally, we want a model with both low MAE and RMSE because that means it's making the most accurate predictions overall.


Conclusions

  • Results: With a Mean Absolute Error (MAE) of approximately 78.51 and a Root Mean Squared Error (RMSE) of about 92.57. This means, on average, its predictions were off by around 78.51 units. For our particular case, during this specific range of time that includes marketing campaign periods, Random Forest performs better than the other models used from other libraries.
  • Considerations: Extrapolation outside the range of the training data and imbalanced datasets are key factors to keep in mind while using Random Forest for time-series.
  • Test different time frames: Some models may perform better on outlier days or weeks (i.e. Black Friday) but my underperform on 'regular' seasons. It may be worth comparing different approaches' performance over time to get the best results. For example, I tested other time frames without marketing campaign periods and VARIMA from Darts seemed to perform better.

As always in data, the choice between models and libraries is not obvious. Each method has its strengths and weaknesses, so it's essential to consider your specific forecasting needs and the characteristics of your data when selecting the most suitable tool for your application.


Hope you enjoyed the article! You can also find me at:


Soledad Galli

Data scientist | Best-selling instructor | Open-source developer | Book author

7 个月

masterpiece :)

Shathyan Raja

Performance & Digital Marketer - User Acquisition | Retention | Revenue | eCommerce & App Marketing

7 个月

It's fascinating to delve into time-series forecasting challenges. Understanding outliers' impact on models is crucial for accurate predictions. Exploring various methods enriches your knowledge base Alvaro Ager

Alejandro Municio Aránguez

Data Science Manager @ Cabify

7 个月

Great post Alvaro Ager! Very interesting results

Joaquin Amat Rodrigo

Senior Data Scientist focused on ML and Forecasting ? Helping teams gain business insights and scale with data-driven strategies ? Co-Author of skforecast

7 个月

?Buen trabajo Alvaro Ager!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了