Time Series Analysis: The Case of Corporation Favorita

Time Series Analysis: The Case of Corporation Favorita

Forecasting Sales in Retail: A Machine Learning Approach for Corporation Favorita

Introduction

Accurate sales forecasting is crucial in retail for optimizing inventory, staffing, and business strategy. This project develops a machine learning model to predict sales at Favorita stores by analyzing trends, customer behavior, and external factors like holidays and promotions.

Our insights will enable our client to optimize inventory, refine marketing strategies, and enhance profitability. We’ve created both statistical and machine learning models to ensure accuracy and flexibility.

Following the CRISP-DM framework, we leveraged data from marketing and sales teams to develop and validate predictive models.

For technical details, visit my GitHub repository. For a project summary and insights, explore my PowerBi dashboard

CRISP-DM FRAMEWORK

Objective of the Project

The primary goal was to design and deploy machine learning models that accurately predict sales across various Corporation Favorita locations, enabling the company to optimize inventory management and enhance profitability

Goal of the Project

The ultimate goal was to build models that more accurately predict the unit sales for thousands of items sold at different Favorita stores.

Note:

- Transferred holidays: Officially fall on one day but celebrated on another (e.g., Independencia de Guayaquil moved from Oct 9 to Oct 12). - Bridge days: Extra days added to holidays to extend breaks. - Work Days: Days off (e.g., Saturdays) used to compensate for Bridge days. - Additional holidays: Extra days added to regular holidays (e.g., Christmas Eve).

Important context:

- Public sector wages are paid every 2 weeks on the 15th and last day of the month, potentially impacting supermarket sales. - The 2016 Ecuador earthquake (magnitude 7.8) led to relief efforts and altered supermarket sales for several weeks.”

Data Sources

The datasets for this project were extracted from three sources:

  1. First Dataset: Extracted from Microsoft SQL Server, this dataset includes three tables:

  • Oil Prices
  • Holiday Events
  • Stores

2. Second Dataset: Downloaded from OneDrive, this dataset includes two tables:

  • Sample Submission
  • Test

3. Third Dataset: Downloaded from a GitHub repository, this dataset includes two tables:

  • Train
  • Transactions

Hypothesis Testing: Impact of Promotional Activities on Sales

To understand the effectiveness of promotional activities, we will conduct hypothesis testing on the data. Our hypotheses are:

  • Null Hypothesis (H0): Promotional activities do not have a significant impact on sales.
  • Alternate Hypothesis (H1): Promotional activities have a significant impact on sales.

Statistical analysis following hypothesis testing revealed a significant correlation between promotion activities and Favorita store sales, prompting us to reject the null hypothesis. Further examination showed that promotions consistently generated positive outcomes, outperforming non-promotional activities in terms of sales. These results emphasize the crucial role of strategic promotions in boosting sales growth and profitability. The accompanying visual illustration reinforces this finding, depicting a pronounced upward trend in sales during promotional periods, underscoring their considerable impact.

Data Analysis and Model Building

The project will involve several steps, including data cleaning, feature engineering, model selection, and validation. The data analysis will focus on aligning sales data with promotional periods to evaluate their impact. We will then build and validate machine learning models to forecast product.

Analytical Questions

We’ve crafted targeted questions to explore the corporation’s performance in the face of various influences. By examining specific operational and market factors, these questions aim to uncover meaningful insights that shed light on the business landscape

1. Is the train dataset complete (has all the required dates)?

2. Which dates have the lowest and highest sales for each year (excluding days the store was closed)?

3. Compare the sales for each month across the years and determine which month of which year had the highest sales.

4. Did the earthquake impact sales?

5. Are certain stores or groups of stores selling more products? (Cluster, city, state, type)

6. Are sales affected by promotions, oil prices and holidays?

Oil prices and sales show a weak negative correlation (-0.075), indicating a slight decline in sales as oil prices rise. However, the relationship is not robust, and other factors may be involved.

Most sales are made on regular holidays and low sales are made on Bridged holidays

7. What analysis can we get from the date and its extractable features?

The Month of December and July have the most amount of sales.

Saturday and Sunday have the most amount of sales.

The 31st day of the month has the least amount of sales recorded.

8. Which product family and stores did the promotions affect.

Grocery I is the most affected product family. The effect is positive since most sales were made from the promoted items

9. Does the payment of wages in the public sector on the 15th and last days of the month influence the store sales.

Sales spike after month-end pay, peaking on day 2 and stabilizing by day 7, with a smaller mid-month bump.

Data Preprocessing and Feature Engineering

We handled missing values after merging and feature creation, added missing dates to ensure a complete timeline, renamed columns for clarity and consistency, and verified correct data types for each column.

Next, we performed feature engineering to extract relevant insights, standardized the data to ensure scalability, and encoded features for categorical variables.

Finally, we split the dataset into training and testing sets, preparing it for modeling.

Modeling and Evaluation

A range of machine learning models were utilized for sales prediction, including XGBoost, Gradient Boosting, Decision Tree, Linear Regression, SARIMA, and ARIMA. Model performance was comprehensively evaluated using four key metrics: root mean squared logarithmic error (RMSLE), root mean squared error (RMSE), mean squared error (MSE), and mean absolute error (MAE).”

Lower RMSLE values indicate better performance. With the lowest RMSLE, Decision Tree Regressor models outperform others. Given their already excellent performance, further hyperparameter tuning may not be necessary or yield significant improvements.

CONCLUSION

The Decision Tree Regressor together with the ARIMA & SARIMA models excel in sales prediction, handling diverse regional, store, and item variations. The best model choice depends on business needs and interpretability. This analysis provides valuable insights for retailers to optimize strategies, enhance decision-making, and boost profitability. As the retail landscape evolves, leveraging advanced analytics and machine learning is crucial for competitiveness.

要查看或添加评论,请登录

Victor Osei Duah的更多文章

社区洞察

其他会员也浏览了