登录查看更多内容

The Machine Learning Approach to Regression Analysis: Stores Sales Time Series

Maanenyi Nyande

Food Safety | ISO 22000, HACCP| Biochemist | Quality Data Analyst | Co-founder at DietRight Clinical Consult | Data-driven Insights

发布日期: 2024年2月5日

In the dynamic realm of retail, accurate sales forecasting is the cornerstone for effective inventory management, ensuring product availability, and ultimately driving revenue growth.

This project delves into the vast dataset encompassing the sales records of 54 Favorita grocery stores in Ecuador, with the objective of constructing a robust machine learning model for demand and sales prediction.

Exploring the Sales Landscape

Assessing Overall Trends

Our journey begins with a comprehensive exploratory analysis aimed at deciphering the intricate sales landscape. We scrutinize overall sales trends across the multitude of products and stores, seeking patterns that can guide our forecasting model.

Impact of Promotions

Promotions play a pivotal role in influencing consumer behavior. Through meticulous analysis, we unravel the impact of promotions on sales, identifying trends that allow us to enhance the accuracy of our forecasting model.

Impact of Earthquake

Effect of External Events External events, such as the 2016 earthquake, can significantly impact sales dynamics. We investigate how such events contribute to fluctuations in demand, providing insights that contribute to the robustness of our predictive model.

Variations Across Stores

Not all stores are created equal. We examine variations in sales patterns across the 54 Favorita grocery stores, identifying store-specific trends that inform our forecasting model.

Crafting the Foundation: Feature Engineering

To prepare our data for modeling, we delve into feature engineering, a crucial phase in constructing an effective machine learning model.

Lag Features

A lagged version of a variable refers to its value at a previous time step. In time series forecasting, creating lag features involves using past observations of a variable as input features for predicting future values. WE created new columns that contain the sales values from previous time steps.

#A lagged versions of the 'sales' variable
data['sales_lag_1'] = data['sales'].shift(1)data['sales_lag_7'] = data['sales'].shift(7)

#A lag version of the week column 
data['week_lag_1'] = data['Week'].shift(1)

Rolling Average Features

Rolling averages, also known as moving averages, are a statistical technique used to smooth out fluctuations in data over time. They are commonly used in time series analysis to identify patterns, trends, or changes in data. The rolling average smooths out the daily fluctuations and provides a clearer view of the underlying trend in the data. It is often used to identify long-term patterns or changes in data over time.

We calculated the rolling averages for the 'sales' and 'transactions' columns using a window size of 7. This will calculate the average value for each day based on the previous 7 days.

#Calculate rolling averages for 'sales' and 'transactions'window = 7 ?
#With a window size of 7

data['sales_rolling_avg'] = data['sales'].rolling(window=window).mean()data['transactions_rolling_avg'] = data['transactions'].rolling(window=window).mean()

Extracting Date Attributes Temporal patterns often hold the key to sales forecasting. We extract relevant date attributes, allowing our model to capture the nuances of daily, monthly, and yearly variations in sales.

#Indexing the date column
data = data.set_index('date')

Merging External Datasets

Augmenting our sales data with external datasets enriches our feature set. This step enhances the model's ability to grasp the multifaceted factors influencing sales.

#Merging interpolated train and interpolated oil datasets
Train = train_merged_interpolated.merge(oil_merged_interpolated,how='inner',on='date')Train.head()

Handling Missing Values

Clean data is essential for reliable predictions. We employ strategies to handle missing values, ensuring the completeness and accuracy of our dataset. ### Building and Evaluating Models With our dataset prepared, we transition into the modeling phase, leveraging various techniques to construct a forecasting model.

#Checking for missing values for the merged 
datasetTrain.isna().sum()

#Changing the data type in Train DataFrame to float
Train['onpromotion'] = Train['onpromotion'].astype(float)

#Assuming df is your DataFrame
Train['onpromotion'] = Train['onpromotion'].astype(int)

Checking Sales Data Stationarity

Stationarity is a prerequisite for time series modeling. We subject our sales data to rigorous stationarity checks, ensuring the stability required for accurate forecasting.

Checking for stationarity Using KPSS

Null Hypothesis : Series is Stationary

Alternative Hypothesis : Series is not Stationary

kpss_test = kpss(sales_df['sales'])
kpss_df = pd.DataFrame({"Metric":["Test Statistics","p-value","No. of lags used", ? ? ? ?],"Values":[kpss_test[0],kpss_test[1],kpss_test[2]]})kpss_df

领英推荐

IT Foundation: Big Data/ Analytics and Cybersecurity

Swaminathan Nagarajan 1 年前

CASE STUDY: Sales Forecasting with Geo AI

Targomo 2 年前

Basic Time Series Analysis: Studying Trends Over Time

Brett Graham 2 个月前

Applying ARIMA for Forecasting

Time series forecasting often involves the implementation of specialized models employ ARIMA (Auto Regressive Integrated Moving Average) to capture time-dependent patterns in our sales data.

#Model Evaluation
mse = mean_squared_error(y_eval, AR_model_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_eval, AR_model_pred)

#Apply the absolute value function to both y_eval and y_pred
y_eval_abs = abs(y_eval)
y_pred_abs = abs(AR_model_pred)

#Calculate msle before using it
msle = mean_squared_log_error(y_eval_abs, y_pred_abs)
rmsle = np.sqrt(msle)

#Combining the evaluation metrics
results = pd.DataFrame([['AR_model', mse, msle, rmse, rmsle]],
                       columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results

Using Regression Algorithms

Regression algorithms provide a broader perspective on sales prediction. By incorporating regression techniques, we refine our forecasting model, enhancing its predictive power.

The Craft of Feature Engineering (Continued)

Our journey through time series analysis delves deeper into the craft of feature engineering, a pivotal aspect that shapes the predictive prowess of our models. Having addressed missing values and established a robust foundation, we now focus on transforming variables to distill essential insights.

Handling Categorical Variables with One-Hot Encoding

Categorical variables are a common feature in datasets, and their effective representation is crucial for model performance. One-Hot Encoding is a powerful technique employed to convert categorical variables into numerical form, ensuring compatibility with machine learning algorithms.

#Apply One-Hot Encoding to categorical variables 
df_encoded = pd.get_dummies(df_train, columns=['item_nbr', 'store_nbr', 'holiday_type'])

The resulting `df_encoded` DataFrame now incorporates binary columns for each category, enhancing the model's ability to discern patterns within these variables.

Scaling Features for Model Consistency

Feature scaling is another critical step in preparing the dataset for modeling. It ensures that numerical features are on a comparable scale, preventing certain variables from dominating others.

from sklearn.preprocessing import StandardScaler  

#Initialize the StandardScaler 
scaler = StandardScaler()  
#Scale numerical features 
df_scaled = scaler.fit_transform(df_encoded[['sales', 'transactions']])

Scaling is particularly beneficial for algorithms sensitive to the magnitude of features, promoting consistent and unbiased model performance.

Mitigating Outliers for Robust Modeling

Outliers can significantly impact the performance of predictive models, leading to skewed results. Addressing outliers is essential for creating a robust and reliable models

#Identify and mitigate outliers in the 'sales' column using statistical methods 

Q1 = df_scaled['sales'].quantile(0.25) Q3 = df_scaled['sales'].quantile(0.75) IQR = Q3 - Q1  

#Remove outliers beyond a certain threshold 
df_no_outliers = df_scaled[(df_scaled['sales'] >= Q1 - 1.5 * IQR) & (df_scaled['sales'] <= Q3 + 1.5 * IQR)]

This step ensures that extreme values do not unduly influence the model's learning process, leading to more reliable predictions.

Strategic Model Application

With our dataset meticulously preprocessed and enriched through feature engineering, we turn our attention to the strategic application of machine learning models.

Linear Regression: Unveiling Linear Relationships Linear Regression serves as a foundational model in predictive analytics, especially when exploring linear relationships between features and the target variable.

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error  

#Split the dataset into training and testing sets 
X = df_no_outliers.drop('sales', axis=1) y = df_no_outliers['sales'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

#Initialize and train the Linear Regression model 

linear_model = LinearRegression() linear_model.fit(X_train, y_train)  

#Predictions 
y_pred_linear = linear_model.predict(X_test)  

#Evaluate the model 
linear_rmse = mean_squared_error(y_test, y_pred_linear, squared=False)

Linear Regression provides a transparent view of feature importance and their impact on sales predictions.

Random Forests: Harnessing Ensemble Learning Random Forests introduce an ensemble learning approach, leveraging multiple decision trees to enhance predictive accuracy.

from sklearn.ensemble import RandomForestRegressor  

#Initialize and train the Random Forests model 
rf_model = RandomForestRegressor(random_state=42) rf_model.fit(X_train, y_train)  
#Predictions 
y_pred_rf = rf_model.predict(X_test)  

#Evaluate the model rf_rmse = mean_squared_error(y_test, y_pred_rf, squared=False)

Random Forests excel in capturing complex relationships within the dataset, offering improved accuracy and resilience against overfitting.

Visualizing Model Performance

Visualizing the performance of our models is integral to understanding their strengths and limitations. Utilizing tools like matplotlib or seaborn, we can create visualizations that showcase predicted versus actual sales.

These visualizations provide a tangible representation of how well our models align with actual sales data.

Technical Analysis and Insights

Sales Trends: Yearly seasonality, no long-term rise/fall

Promotions: Higher average sales for promoted items

2016 Earthquake: Boosted sales likely due to demand surge

Regional Differences: One state dominated sales

Seasonal Patterns: Holiday periods show spikes in sales

Conclusion

As we conclude this exploration into time series analysis and sales prediction, the significance of a methodical and data-driven approach becomes evident. From feature engineering techniques to the strategic application of models, each step contributes to a nuanced understanding of the complex dynamics influencing sales forecasting. This article stands as a comprehensive guide, equipping professionals with the knowledge and insights needed to navigate the intricacies of time series analysis in the evolving landscape of business analytics. In the subsequent sections, we will delve into answering analytical questions, presenting visualizations that unravel insights hidden within the data.

Predictive modeling was effective for demand forecasting of Favorita sales data exhibiting complex seasonal and event-related patterns. The incorporation of external factors can further refine estimates. Accurate forecasting facilitates data-driven decisions for supply chain planning, inventory management and marketing campaigns.

References

A Time Series Tutorial with Pandas
Inventory Optimization with Machine Learning
The CRISP-DM Methodology for Data Mining Projects

As we navigate the intricacies of sales forecasting, this project stands as a testament to the power of data-driven insights in steering businesses towards informed and strategic decision-making.

To follow through with this project, check out my Github repository and get started here

Appreciation

I highly recommend Azubi Africa for their comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link (w w w . a z u b i a f r i c a . o r g) to learn more about Azubi Africa life-changing programs

DataInsta

1 年

This is amazing! Can't wait to dive into it! ??

Data & Analytics

1 年

Exciting journey ahead! Can't wait to uncover the insights. ??

查看更多评论

要查看或添加评论，请登录

Maanenyi Nyande的更多文章

Deploying a Sepsis Prediction API Using FastAPI: A Comprehensive Guide

2024年4月9日

Deploying a Sepsis Prediction API Using FastAPI: A Comprehensive Guide

Introduction In the realm of machine learning, deploying a model is just as crucial as building it. To make a model…
Enhancing Customer Retention with Predictive Analytics: A Streamlit Application

2024年3月10日

Enhancing Customer Retention with Predictive Analytics: A Streamlit Application

Introduction In today's fiercely competitive business landscape, maintaining a loyal customer base is essential for…
Deciphering Customer Churn Dynamics in the Telecommunication Industry

2023年12月24日

Deciphering Customer Churn Dynamics in the Telecommunication Industry

Introduction Thriving in the dynamic landscape of the telecommunication industry necessitates a keen understanding of…
Navigating India's Start-up Seas: Decoding Funding Trends and Strategic Insights

2023年11月26日

Navigating India's Start-up Seas: Decoding Funding Trends and Strategic Insights

Introduction In a dynamic decade of evolution, India's start-up ecosystem has surged to become the third-largest…

1 条评论

The Machine Learning Approach to Regression Analysis: Stores Sales Time Series

Maanenyi Nyande

Food Safety | ISO 22000, HACCP| Biochemist | Quality Data Analyst | Co-founder at DietRight Clinical Consult | Data-driven Insights

Exploring the Sales Landscape

Assessing Overall Trends

Impact of Promotions

Impact of Earthquake

Variations Across Stores

Crafting the Foundation: Feature Engineering

Merging External Datasets

Handling Missing Values

Checking Sales Data Stationarity

领英推荐

Applying ARIMA for Forecasting

Using Regression Algorithms

The Craft of Feature Engineering (Continued)

Handling Categorical Variables with One-Hot Encoding

Scaling Features for Model Consistency

Mitigating Outliers for Robust Modeling

Strategic Model Application

Visualizing Model Performance

Technical Analysis and Insights

Conclusion

References

Appreciation

Maanenyi Nyande的更多文章

社区洞察

其他会员也浏览了

Rossman Data Store- ML Regression Problem

Bridging Theory and Practice: Advanced GAMs for Retail Decisions

The Role of Predictive Analytics in Business Growth

Manage Your Supply Chain Data for Increased Profits with Big Data and AI

Manage Your Supply Chain Data for Increased Profits with Big Data and AI.

Navigating Feature Engineering in Retail

"Predicting Tomorrow: The Art and Science of IT Sales Forecasting"

Predictive Analytics in Real Time: Data-Driven Decision-Making for Sustainable Growth

How Anomalies in Retail Data Reveal Hidden Opportunities

Unveiling the Future: Machine Learning's Power in Predicting Customer Churn in Subscription-Based Enterprises

Exploring the Sales Landscape

Assessing Overall Trends

Impact of Promotions

Impact of Earthquake

Variations Across Stores

Crafting the Foundation: Feature Engineering

Merging External Datasets

Handling Missing Values

Checking Sales Data Stationarity

领英推荐

Applying ARIMA for Forecasting

Using Regression Algorithms

The Craft of Feature Engineering (Continued)

Handling Categorical Variables with One-Hot Encoding

Scaling Features for Model Consistency

Mitigating Outliers for Robust Modeling

Strategic Model Application

Visualizing Model Performance

Technical Analysis and Insights

Conclusion

References

Appreciation

Maanenyi Nyande的更多文章

Deploying a Sepsis Prediction API Using FastAPI: A Comprehensive Guide

Enhancing Customer Retention with Predictive Analytics: A Streamlit Application

Deciphering Customer Churn Dynamics in the Telecommunication Industry

Navigating India's Start-up Seas: Decoding Funding Trends and Strategic Insights

社区洞察

其他会员也浏览了

Rossman Data Store- ML Regression Problem

Bridging Theory and Practice: Advanced GAMs for Retail Decisions

The Role of Predictive Analytics in Business Growth

Manage Your Supply Chain Data for Increased Profits with Big Data and AI

Manage Your Supply Chain Data for Increased Profits with Big Data and AI.

Navigating Feature Engineering in Retail

"Predicting Tomorrow: The Art and Science of IT Sales Forecasting"

Predictive Analytics in Real Time: Data-Driven Decision-Making for Sustainable Growth

How Anomalies in Retail Data Reveal Hidden Opportunities

Unveiling the Future: Machine Learning's Power in Predicting Customer Churn in Subscription-Based Enterprises