The Machine Learning Approach to Regression Analysis: Stores Sales Time Series
Department of Statistics | https://stats.cusat.ac.in/index.php/Ra/details/4

The Machine Learning Approach to Regression Analysis: Stores Sales Time Series

In the dynamic realm of retail, accurate sales forecasting is the cornerstone for effective inventory management, ensuring product availability, and ultimately driving revenue growth.

This project delves into the vast dataset encompassing the sales records of 54 Favorita grocery stores in Ecuador, with the objective of constructing a robust machine learning model for demand and sales prediction.

Exploring the Sales Landscape


Assessing Overall Trends

Our journey begins with a comprehensive exploratory analysis aimed at deciphering the intricate sales landscape. We scrutinize overall sales trends across the multitude of products and stores, seeking patterns that can guide our forecasting model.

Impact of Promotions

Promotions play a pivotal role in influencing consumer behavior. Through meticulous analysis, we unravel the impact of promotions on sales, identifying trends that allow us to enhance the accuracy of our forecasting model.


Impact of Earthquake

Effect of External Events External events, such as the 2016 earthquake, can significantly impact sales dynamics. We investigate how such events contribute to fluctuations in demand, providing insights that contribute to the robustness of our predictive model.

Variations Across Stores

Not all stores are created equal. We examine variations in sales patterns across the 54 Favorita grocery stores, identifying store-specific trends that inform our forecasting model.


Crafting the Foundation: Feature Engineering

To prepare our data for modeling, we delve into feature engineering, a crucial phase in constructing an effective machine learning model.

Lag Features

A lagged version of a variable refers to its value at a previous time step. In time series forecasting, creating lag features involves using past observations of a variable as input features for predicting future values. WE created new columns that contain the sales values from previous time steps.

#A lagged versions of the 'sales' variable
data['sales_lag_1'] = data['sales'].shift(1)data['sales_lag_7'] = data['sales'].shift(7)

#A lag version of the week column 
data['week_lag_1'] = data['Week'].shift(1)        

Rolling Average Features

Rolling averages, also known as moving averages, are a statistical technique used to smooth out fluctuations in data over time. They are commonly used in time series analysis to identify patterns, trends, or changes in data. The rolling average smooths out the daily fluctuations and provides a clearer view of the underlying trend in the data. It is often used to identify long-term patterns or changes in data over time.

We calculated the rolling averages for the 'sales' and 'transactions' columns using a window size of 7. This will calculate the average value for each day based on the previous 7 days.


#Calculate rolling averages for 'sales' and 'transactions'window = 7 ?
#With a window size of 7

data['sales_rolling_avg'] = data['sales'].rolling(window=window).mean()data['transactions_rolling_avg'] = data['transactions'].rolling(window=window).mean()        

Extracting Date Attributes Temporal patterns often hold the key to sales forecasting. We extract relevant date attributes, allowing our model to capture the nuances of daily, monthly, and yearly variations in sales.

#Indexing the date column
data = data.set_index('date')        

Merging External Datasets

Augmenting our sales data with external datasets enriches our feature set. This step enhances the model's ability to grasp the multifaceted factors influencing sales.

#Merging interpolated train and interpolated oil datasets
Train = train_merged_interpolated.merge(oil_merged_interpolated,how='inner',on='date')Train.head()        

Handling Missing Values

Clean data is essential for reliable predictions. We employ strategies to handle missing values, ensuring the completeness and accuracy of our dataset. ### Building and Evaluating Models With our dataset prepared, we transition into the modeling phase, leveraging various techniques to construct a forecasting model.

#Checking for missing values for the merged 
datasetTrain.isna().sum()

#Changing the data type in Train DataFrame to float
Train['onpromotion'] = Train['onpromotion'].astype(float)

#Assuming df is your DataFrame
Train['onpromotion'] = Train['onpromotion'].astype(int)        


Checking Sales Data Stationarity

Stationarity is a prerequisite for time series modeling. We subject our sales data to rigorous stationarity checks, ensuring the stability required for accurate forecasting.

Checking for stationarity Using KPSS

Null Hypothesis : Series is Stationary

Alternative Hypothesis : Series is not Stationary

kpss_test = kpss(sales_df['sales'])
kpss_df = pd.DataFrame({"Metric":["Test Statistics","p-value","No. of lags used", ? ? ? ?],"Values":[kpss_test[0],kpss_test[1],kpss_test[2]]})kpss_df        

Applying ARIMA for Forecasting

Time series forecasting often involves the implementation of specialized models employ ARIMA (Auto Regressive Integrated Moving Average) to capture time-dependent patterns in our sales data.

#Model Evaluation
mse = mean_squared_error(y_eval, AR_model_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_eval, AR_model_pred)

#Apply the absolute value function to both y_eval and y_pred
y_eval_abs = abs(y_eval)
y_pred_abs = abs(AR_model_pred)

#Calculate msle before using it
msle = mean_squared_log_error(y_eval_abs, y_pred_abs)
rmsle = np.sqrt(msle)

#Combining the evaluation metrics
results = pd.DataFrame([['AR_model', mse, msle, rmse, rmsle]],
                       columns=['Model', 'MSE', 'MSLE', 'RMSE', 'RMSLE'])
results        

Using Regression Algorithms

Regression algorithms provide a broader perspective on sales prediction. By incorporating regression techniques, we refine our forecasting model, enhancing its predictive power.

The Craft of Feature Engineering (Continued)

Our journey through time series analysis delves deeper into the craft of feature engineering, a pivotal aspect that shapes the predictive prowess of our models. Having addressed missing values and established a robust foundation, we now focus on transforming variables to distill essential insights.

Handling Categorical Variables with One-Hot Encoding

Categorical variables are a common feature in datasets, and their effective representation is crucial for model performance. One-Hot Encoding is a powerful technique employed to convert categorical variables into numerical form, ensuring compatibility with machine learning algorithms.

#Apply One-Hot Encoding to categorical variables 
df_encoded = pd.get_dummies(df_train, columns=['item_nbr', 'store_nbr', 'holiday_type'])         

The resulting `df_encoded` DataFrame now incorporates binary columns for each category, enhancing the model's ability to discern patterns within these variables.

Scaling Features for Model Consistency

Feature scaling is another critical step in preparing the dataset for modeling. It ensures that numerical features are on a comparable scale, preventing certain variables from dominating others.

from sklearn.preprocessing import StandardScaler  

#Initialize the StandardScaler 
scaler = StandardScaler()  
#Scale numerical features 
df_scaled = scaler.fit_transform(df_encoded[['sales', 'transactions']])        

Scaling is particularly beneficial for algorithms sensitive to the magnitude of features, promoting consistent and unbiased model performance.

Mitigating Outliers for Robust Modeling

Outliers can significantly impact the performance of predictive models, leading to skewed results. Addressing outliers is essential for creating a robust and reliable models

#Identify and mitigate outliers in the 'sales' column using statistical methods 

Q1 = df_scaled['sales'].quantile(0.25) Q3 = df_scaled['sales'].quantile(0.75) IQR = Q3 - Q1  

#Remove outliers beyond a certain threshold 
df_no_outliers = df_scaled[(df_scaled['sales'] >= Q1 - 1.5 * IQR) & (df_scaled['sales'] <= Q3 + 1.5 * IQR)]        

This step ensures that extreme values do not unduly influence the model's learning process, leading to more reliable predictions.

Strategic Model Application

With our dataset meticulously preprocessed and enriched through feature engineering, we turn our attention to the strategic application of machine learning models.

Linear Regression: Unveiling Linear Relationships Linear Regression serves as a foundational model in predictive analytics, especially when exploring linear relationships between features and the target variable.

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error  

#Split the dataset into training and testing sets 
X = df_no_outliers.drop('sales', axis=1) y = df_no_outliers['sales'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  

#Initialize and train the Linear Regression model 

linear_model = LinearRegression() linear_model.fit(X_train, y_train)  

#Predictions 
y_pred_linear = linear_model.predict(X_test)  

#Evaluate the model 
linear_rmse = mean_squared_error(y_test, y_pred_linear, squared=False)          

Linear Regression provides a transparent view of feature importance and their impact on sales predictions.

Random Forests: Harnessing Ensemble Learning Random Forests introduce an ensemble learning approach, leveraging multiple decision trees to enhance predictive accuracy.

from sklearn.ensemble import RandomForestRegressor  

#Initialize and train the Random Forests model 
rf_model = RandomForestRegressor(random_state=42) rf_model.fit(X_train, y_train)  
#Predictions 
y_pred_rf = rf_model.predict(X_test)  

#Evaluate the model rf_rmse = mean_squared_error(y_test, y_pred_rf, squared=False)        

Random Forests excel in capturing complex relationships within the dataset, offering improved accuracy and resilience against overfitting.

Visualizing Model Performance

Visualizing the performance of our models is integral to understanding their strengths and limitations. Utilizing tools like matplotlib or seaborn, we can create visualizations that showcase predicted versus actual sales.


These visualizations provide a tangible representation of how well our models align with actual sales data.

Technical Analysis and Insights

Sales Trends: Yearly seasonality, no long-term rise/fall

Promotions: Higher average sales for promoted items

2016 Earthquake: Boosted sales likely due to demand surge

Regional Differences: One state dominated sales

Seasonal Patterns: Holiday periods show spikes in sales

Conclusion

As we conclude this exploration into time series analysis and sales prediction, the significance of a methodical and data-driven approach becomes evident. From feature engineering techniques to the strategic application of models, each step contributes to a nuanced understanding of the complex dynamics influencing sales forecasting. This article stands as a comprehensive guide, equipping professionals with the knowledge and insights needed to navigate the intricacies of time series analysis in the evolving landscape of business analytics. In the subsequent sections, we will delve into answering analytical questions, presenting visualizations that unravel insights hidden within the data.

Predictive modeling was effective for demand forecasting of Favorita sales data exhibiting complex seasonal and event-related patterns. The incorporation of external factors can further refine estimates. Accurate forecasting facilitates data-driven decisions for supply chain planning, inventory management and marketing campaigns.

References

  1. A Time Series Tutorial with Pandas
  2. Inventory Optimization with Machine Learning
  3. The CRISP-DM Methodology for Data Mining Projects


As we navigate the intricacies of sales forecasting, this project stands as a testament to the power of data-driven insights in steering businesses towards informed and strategic decision-making.

To follow through with this project, check out my Github repository and get started here

Appreciation

I highly recommend Azubi Africa for their comprehensive and effective programs. Read More articles about Azubi Africa here and take a few minutes to visit this link (w w w . a z u b i a f r i c a . o r g) to learn more about Azubi Africa life-changing programs



This is amazing! Can't wait to dive into it! ??

回复

Exciting journey ahead! Can't wait to uncover the insights. ??

回复

要查看或添加评论,请登录

Maanenyi Nyande的更多文章

社区洞察

其他会员也浏览了