Time Series Analysis: The Case of Corporation Favorita
“Forecasting Sales in Retail: A Machine Learning Approach for Corporation Favorita”
Introduction
Accurate sales forecasting is crucial in retail for optimizing inventory, staffing, and business strategy. This project develops a machine learning model to predict sales at Favorita stores by analyzing trends, customer behavior, and external factors like holidays and promotions.
Our insights will enable our client to optimize inventory, refine marketing strategies, and enhance profitability. We’ve created both statistical and machine learning models to ensure accuracy and flexibility.
Following the CRISP-DM framework, we leveraged data from marketing and sales teams to develop and validate predictive models.
For technical details, visit my GitHub repository. For a project summary and insights, explore my PowerBi dashboard
CRISP-DM FRAMEWORK
Objective of the Project
The primary goal was to design and deploy machine learning models that accurately predict sales across various Corporation Favorita locations, enabling the company to optimize inventory management and enhance profitability
Goal of the Project
The ultimate goal was to build models that more accurately predict the unit sales for thousands of items sold at different Favorita stores.
Note:
- Transferred holidays: Officially fall on one day but celebrated on another (e.g., Independencia de Guayaquil moved from Oct 9 to Oct 12). - Bridge days: Extra days added to holidays to extend breaks. - Work Days: Days off (e.g., Saturdays) used to compensate for Bridge days. - Additional holidays: Extra days added to regular holidays (e.g., Christmas Eve).
Important context:
- Public sector wages are paid every 2 weeks on the 15th and last day of the month, potentially impacting supermarket sales. - The 2016 Ecuador earthquake (magnitude 7.8) led to relief efforts and altered supermarket sales for several weeks.”
Data Sources
The datasets for this project were extracted from three sources:
2. Second Dataset: Downloaded from OneDrive, this dataset includes two tables:
3. Third Dataset: Downloaded from a GitHub repository, this dataset includes two tables:
Hypothesis Testing: Impact of Promotional Activities on Sales
To understand the effectiveness of promotional activities, we will conduct hypothesis testing on the data. Our hypotheses are:
Statistical analysis following hypothesis testing revealed a significant correlation between promotion activities and Favorita store sales, prompting us to reject the null hypothesis. Further examination showed that promotions consistently generated positive outcomes, outperforming non-promotional activities in terms of sales. These results emphasize the crucial role of strategic promotions in boosting sales growth and profitability. The accompanying visual illustration reinforces this finding, depicting a pronounced upward trend in sales during promotional periods, underscoring their considerable impact.
Data Analysis and Model Building
The project will involve several steps, including data cleaning, feature engineering, model selection, and validation. The data analysis will focus on aligning sales data with promotional periods to evaluate their impact. We will then build and validate machine learning models to forecast product.
Analytical Questions
We’ve crafted targeted questions to explore the corporation’s performance in the face of various influences. By examining specific operational and market factors, these questions aim to uncover meaningful insights that shed light on the business landscape
1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?
领英推荐
3. Compare the sales for each month across the years and determine which month of which year had the highest sales.
4. Did the earthquake impact sales?
5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)
6. Are sales affected by promotions, oil prices and holidays?
Oil prices and sales show a weak negative correlation (-0.075), indicating a slight decline in sales as oil prices rise. However, the relationship is not robust, and other factors may be involved.
Most sales are made on regular holidays and low sales are made on Bridged holidays
7. What analysis can we get from the date and its extractable features?
The Month of December and July have the most amount of sales.
Saturday and Sunday have the most amount of sales.
The 31st day of the month has the least amount of sales recorded.
8. Which product family and stores did the promotions affect.
Grocery I is the most affected product family. The effect is positive since most sales were made from the promoted items
9. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.
Sales spike after month-end pay, peaking on day 2 and stabilizing by day 7, with a smaller mid-month bump.
Data Preprocessing and Feature Engineering
We handled missing values after merging and feature creation, added missing dates to ensure a complete timeline, renamed columns for clarity and consistency, and verified correct data types for each column.
Next, we performed feature engineering to extract relevant insights, standardized the data to ensure scalability, and encoded features for categorical variables.
Finally, we split the dataset into training and testing sets, preparing it for modeling.
Modeling and Evaluation
A range of machine learning models were utilized for sales prediction, including XGBoost, Gradient Boosting, Decision Tree, Linear Regression, SARIMA, and ARIMA. Model performance was comprehensively evaluated using four key metrics: root mean squared logarithmic error (RMSLE), root mean squared error (RMSE), mean squared error (MSE), and mean absolute error (MAE).”
Lower RMSLE values indicate better performance. With the lowest RMSLE, Decision Tree Regressor models outperform others. Given their already excellent performance, further hyperparameter tuning may not be necessary or yield significant improvements.
CONCLUSION
The Decision Tree Regressor together with the ARIMA & SARIMA models excel in sales prediction, handling diverse regional, store, and item variations. The best model choice depends on business needs and interpretability. This analysis provides valuable insights for retailers to optimize strategies, enhance decision-making, and boost profitability. As the retail landscape evolves, leveraging advanced analytics and machine learning is crucial for competitiveness.