Forecasting wine demand through time series analysis

Forecasting wine demand through time series analysis


Clique aqui para ler esse artigo em Português.


Time series analysis aims to identify patterns and forecast trends to assist in decision-making. Additionally, this type of study can be applied in various fields to help understand the behavior of something over time.

In this article, I present the main results obtained in building the best model to predict the demand for wines in a company specialized in this product.


* Note

This is a summarized article that shows the main results.

To check the full study, including the codes and methodology used, click here.


Summary

  1. About the Project
  2. General Objective
  3. Dataset
  4. Exploratory Data Analysis
  5. Feature Engineering
  6. Business Analysis
  7. Demand Forecasting: 7.1. Transformation of the Series into Stationary; 7.2. Model Creation
  8. Model Evaluation
  9. Conclusion


1. About the Project

Being able to predict demand for a product or service is a strategic technique that can assist in decision-making, planning, inventory management, resource optimization, and even customer satisfaction. There are many applications where this type of tool can be applied, across companies in various sectors.

This type of problem is typically addressed through time series analysis, which involves data distributed at sequential and regular intervals over time. This means that data is collected at time intervals such as hourly, daily, monthly, etc., and each new data point depends on the previous ones. Without the temporal connection of the data, the problem could be solved using linear regression (Analytics Vidhya, 2018).


2. General Objective

To develop a machine learning-based algorithm for predicting the demand for wines in a company specializing in this product.



3. Dataset

The data used in this project was provided by Rafael Duarte and consists of two files: one containing historical sales data and the other containing information about the wines. It includes daily sales data from three stores, with 219 products in stock over a period of three years (from January 2018 to December 2020).


4. Exploratory Data Analysis

This is an essential step in data science projects, aiming to gain a better understanding of the data by identifying patterns, outliers, potential relationships between variables, and more.

Among the most important findings are:

  • The products originate from six different countries, with France representing 70%, followed by Italy with 10%, and Spain with 8%.
  • Red wine accounts for 60% of the dataset, followed by white wine at 31.5%, and sparkling wine at 4.6%.
  • There are a total of 58 different producers, with Domaine Ponsot being the most prominent at 5.5%, followed by La Chablisienne at 4.6%, and Domaine Matrot at 4.1%.
  • The cheapest wine is the Spanish sparkling wine "Cava Juvé & Camps Cinta Purpura Reserva Brut," priced at approximately 9 dollars.
  • The most expensive wine is the French red "Domaine Ponsot Clos de La Roche Grand Cru Cuvee Vieilles Vignes - Magnum" from the 2017 vintage, with an approximate cost of 1,900 dollars.


5. Feature Engineering

To improve the performance of a machine learning model, attribute engineering was employed. This involved breaking down the temporal information into 8 new attributes. Additionally, the total revenue for each product per day was calculated, resulting in 2 new attributes: one for the value in Brazilian real and another for the value in US dollars.

With these new attributes, it was observed in a statistical analysis that the average spending is 12,138 dollars. However, there is a high standard deviation and the median is lower than the mean, at 5,051 dollars. This indicates the presence of values considered outliers. However, since this is common in this field, these data points were retained to maintain the representation of real data.


6. Business Analysis

After merging the two datasets into one and performing the necessary data cleaning and preprocessing, a more in-depth analysis of the data was conducted. The goal was to discover relevant insights that would allow the company to better understand its demand and use this information to improve its sales, offerings, inventory management, and ultimately increase its profits.


  • Wines: Best-Selling (by units) vs. Highest RevenuesThe wine with the highest number of units sold during the period also generated the highest revenue, which is the Domaine Ponsot Chapelle-Chambertin Grand Cru. This wine, priced at 656 dollars per bottle, is also one of the most expensive wines sold by the company.
  • Producers: Median Prices (per bottle) vs. Number of Offered Products (per producer)The producer with the highest median price per bottle, valued at 656 dollars, also has the highest number of products offered, which is Domaine Ponsot.
  • Producers: Best-Selling (per bottle) vs. Highest RevenuesOnce again, Domaine Ponsot takes the lead.
  • Region: Number of Offered Products vs. Highest RevenuesThe French region of Burgundy generates the highest revenue and also has the most products offered by the company, with a total of 70 different labels. It's worth noting that the Ribera del Duero region in Spain, with only 5 products, ranked third in terms of revenue generated.


7. Demand Forecasting

7.1. Transformation of the Time Series into Stationary

The data was properly prepared, and the series (stationary or non-stationary) was checked using the Augmented Dickey-Fuller Test (ADF). With a p-value of 0.1533, it was inferred that it was a non-stationary series. Therefore, the following treatments were applied to transform it into a stationary series:

1. Log transformation to reduce the magnitude of the values in the series.

2. Subtraction of the 30-day moving average from the log-transformed series.

3. Finally, differencing.

In summary, the treatment applied to make the series stationary involved removing the trend and seasonality from the time series data. After these transformations, the new ADF test yielded a p-value of 0.00000000000000000012.

Below, you can see the series before and after the treatments.

Comparison of the series after data treatments (from left to right): original series, series with log transformation and subtraction of the 30-day moving average, and finally, the series with differencing.


It's worth noting that a stationary time series means that the dataset exhibits constant statistical characteristics over time. In other words, it has constant mean, variance, and covariance within the time interval. Working with stationary series is essential because most statistical methods assume the premise of dealing with a stationary series for their calculations.

After this, the dataset was split into training and validation data to allow for the proper evaluation of models later on. The chosen period was 120 days, and it's important to remember that the longer the forecast horizon, the higher the chances of the model making errors. Therefore, shorter forecasting periods should result in smaller errors.


7.2. Model Creation

Several models were developed to evaluate their performance:

  • Naive Approach
  • 3-Day Moving Average
  • Holt’s Linear Trend Model
  • ARIMA
  • Prophet
  • PyCaret

It's important to highlight that PyCaret is an auto-machine learning library that, in this case, built 27 different models. The best-performing model selected was the Huber w/ Cond. Deseasonalize & Detrending.


8. Model Evaluation

The final step is performance evaluation, which involves assessing how well a model's predictions match the actual outcomes. To facilitate comparison, the techniques used to make the series stationary were reversed.

First, let's present the results of the following models: Naive Approach (in pink), 3-day Moving Average (in blue), and Holt (in purple). The gray area represents the training period for the models, and the light blue represents the actual data for the period the models predicted. In this way, you can see that the Holt algorithm came closest to reality.


Next, we have a graph with the models: ARIMA (in pink), Prophet (in purple), and PyCaret (in orange). Initially, these models are similar until Prophet starts to diverge around the point of 01-11-2020. For this reason, it appears that the Prophet model better fits the actual data.



In a second step, the models were evaluated using two metrics that measure model errors to obtain a statistically more accurate assessment of which model performed best. The Mean Absolute Error (MAE) is the absolute error value in the forecast compared to the actual series, calculated as the average of the absolute values of the error magnitudes. Therefore, the lower its value, the better the model. Additionally, the Mean Absolute Percentage Error (MAPE) shows how much the predictions differ from the actual value in percentage terms, representing the percentage equivalent of MAE.

The results obtained were:

           Assessment Metrics           
----------------------------------------
     MAE     MAPE
Naive Approach:  1631.18   0.0357
Moving Average 3:1428.87   0.0309
Holt:            1370.60   0.0290
ARIMA:           1089.11   0.0233
Prophet:         1346.82   0.0287
PyCaret:         1756.81   0.0383        

Therefore, the best model generated was by the ARIMA algorithm, with an error rate of only 2.33%. Although Prophet was not far behind, with a 2.87% error rate. It's worth noting that the execution time of Prophet is much faster compared to ARIMA, which can be an important consideration depending on the use case.

It's also important to highlight the result obtained by Holt, with a 2.90% error rate. Considering that Holt is a simpler algorithm and achieved results very close to Prophet, it could be an excellent alternative for practical applications.


9. Conclusion

The central objective of this study was to develop an algorithm capable of predicting the demand for wines in a store specializing in this product.

After proper data preprocessing and attribute engineering, various analyses were conducted to understand the business. Subsequently, predictive demand algorithms were developed.


Infographic with the main points of this study


Since we were dealing with a non-stationary time series, according to the Augmented Dickey-Fuller (ADF) test, a transformation of the series into stationary was performed to achieve better results in some algorithms.

The forecasting had a parameter of 120 days, and the following methods were used: Naive approach, Moving Average, Holt’s Linear Trend Model, ARIMA, Prophet, and PyCaret. As a result, based on MAPE (Mean Absolute Percentage Error), the ARIMA model performed the best with an error rate of only 2.33%, followed by Prophet with 2.87%, and Holt with 2.90%.

When applying the model, it's essential to consider the specific objective it will serve, as there are algorithms among these options that run more efficiently than others.


Get to know more about this?study

This study is available on Google Colab and on GitHub. Just click on the images below to be redirected.


[LoffredoDS] Demand forecasting with Time Series.ipynb


raffaloffredo/demand_forecasting_with_time_series



Let's Connect!

Douglas Mendes

Analista de Dados | Python | PostgreSQL | ETL | AWS

1 年

This is really good, congratulations!

要查看或添加评论,请登录

Raffaela Loffredo的更多文章

社区洞察

其他会员也浏览了