Time Series Forecasting And Analysis Of Store Sales Of Corporation Favorita Products-Regression Findings.
Jabo Justin
Technical Support Engineer at Micro Focus && at Tek-Experts (||Advanced Authentication ||Secure Login||Network Security Products Team),, ||Data Analyst|| Data Engineer|| (BI) Analyst|| Team Leader Manager At Azubi Africa
DESCRIPTION
This project aims to analyze and forecast the sales of a store based on the time series data from Corporation Favorita, a sizable supermarket retailer with headquarters in Ecuador, this research seeks to assess the store’s sales.
The objective is to create a model that can properly forecast future sales using the thousands of products sold at different Favorita locations, in order to assist the management of the store in formulating inventory and sales plans.
We will use sales forecasting to extrapolate future sales levels for a company using historical sales data because Favorita Corporation’s sales over the past four years have generated a lot of data. Our intention is to support business managers in making forecasts about the future based on this information, which has been kept on file for a set period of time following the event or after it occurred.
As part of this research, we will build models using historical analysis, formulate scientific hypotheses based on time-stamped historical data, and then utilize those models to make observations and direct strategic decision-making in the future. In order to enhance operations and ultimately sales, we would like to assist management at Favorita Corporation in gaining some insights from their data.
Methodology:
According to?IBM,?CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts.
The CRISP-DMThe six stages of the data mining lifecycle are as follows:
Business Understanding:
The goal of this project is to increase sales by learning more about long-term sales trends. understand prior events, how they affected sales, what can be done to remedy them, and, if necessary, the next step. This research attempts to examine several regression approaches in order to produce some predictions.
Hypothesis:?The sales of the store are affected by various factors, such as the day of the week, season, promotions, and other external factors. By analyzing these factors and building a time series forecasting model, we can accurately predict the store’s future sales.
Null Hypothesis:?Some products have sold more than others in terms of revenue.
Alternative Hypothesis:?The retail company received an equal amount of revenue from each product.
Questions:
1. Is the train dataset complete (has all the required dates)?
2. Which dates have the lowest and highest sales for each year?
3. Did the earthquake impact sales?
4. Are certain groups of stores selling more products? (Cluster, city, state, type)
5. Are sales affected by promotions, oil prices and holidays?
6. What analysis can we get from the date and its extractable features?
7. What is the RMSLE, RMSE, MSLE, MSE from the ML model?
Our task is to create a model that can more precisely forecast the unit sales of numerous items.
Data Understanding:
Determining the data’s quality and having a general idea of the types of analyses that the data might be subjected to are essential components of any data science endeavor.
For our project, here are the various data sets and their description;
File descriptions as well as data field details:
train.csv
· Dates, store, product, and promotion information, along with sales figures, are all included in the training data. Additional files contain further data that you might find helpful while creating your models. The training data includes goal sales as well as time series for the features store_nbr, family, and onpromotion.
· store_nbr specifies the location of the retailer where the goods are sold.
· The family describes the category of sold goods.
· sales provides the overall sales for a family of products at a specific retailer on a specific day. Since things can be sold in fractional units (1.5 kg of cheese, as opposed to 1 bag of chips, for example), fractional values are possible.
· Onpromotion provides the total number of products in a family that were promoted in a store on a specific date.
test.csv
· Test data with the same characteristics as training data. For the dates in this file, you will forecast the goal sales.
· The dates in the test data are for the period of time that follows the final date in the training data by 15 days.
transaction.csv
· Consists of the date, store_nbr, and transaction for that particular day.
sample_submission.csv
· A submission file example in the appropriate format.
stores.csv
· Save metadata, such as cluster, type, city, and state.
· Cluster refers to a collection of related stores.
oil.csv
· The current day’s oil price, which takes into account prices from both the train and test data timeframes. (Because it depends so heavily on oil, Ecuador’s economy is extremely susceptible to fluctuations in oil prices.)
holidays_events.csv
·?Events and Holidays, with metadata
NOTE:?Focus in particular on the transferred column. A holiday that was shifted by the government officially falls on that day but is observed on a different day. A transferred day resembles a typical day more so than a holiday. Look for the relevant row where type is Transfer to determine the day when it was seen.
For instance, the Guayaquil holiday Independencia de Guayaquil was moved from 2012–10–09 to 2012–10–12, therefore it was observed on that day. Extra days that are added to a holiday (such as to stretch the break across a long weekend) are referred to as “Bridge Days.” These are commonly made up by the type Work Day, which is a day that is intended to pay back the Bridge but is not typically planned for work (such as a Saturday).
· Additional holidays are days that are added to a regular calendar holiday, such as when Christmas Eve is observed as a holiday.
Data Preparation
Data cleaning, exploratory data analysis (univariate and bivariate), computing missing values, feature engineering, etc. are typically included in the data preparation process.
Data Handling
To get started, you’ll need to import some packages: pandas for data manipulation, numpy and matplotlib for visualization, seaborn for styling, sklearn for feature processing and machine learning (including estimators, catboost, and lightGBM), and other packages such as os. All of these can be imported in one line with the following code:
To get started, you’ll need to import a few packages:
·?PANDAS?for data manipulation.
·?NUMPY?and matplotlib for visualization.
·?SEABORN?for styling.
· SKLEARN for feature processing and machine learning (including estimators, catboost, and lightGBM).
· and other packages such as OS.
With the following code, all of these may be imported in a single line:
Import Libraries
# Library for EDA
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.impute import SimpleImputer
from ydata_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')
# Import datasets
df_sample_sub = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/sample_submission.csv')
df_stores = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/stores.csv')
df_trans = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/transactions.csv')
df_holi = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/holidays_events.csv')
df_oil = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/oil.csv')
#Loading train & test dataset
领英推荐
df_train = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/train.csv')
df_test = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/test.csv')
For more details on the imported libraries, click the links below:
·?Numpy:
·?Pandas:
·?Seaborn:
Issues with the Data (Data Issues):
Selecting the Right Model:
- We will start with a simple model such as ARIMA or Prophet and evaluate its performance.
- We will perform cross-validation and hyperparameter tuning to fine-tune the model’s parameters and improve its performance.
- If the model does not meet the business requirements, we will reframe the problem by adding more data samples and features or selecting a different algorithm that can better fit the data.
Merge all datasets for further EDA
# combine the datasets on common columns
merged_data = pd.merge(df_train, df_trans, on=['date', 'store_nbr'])
# Merge Holiday data to previous merged data on date column
merged_data2 = pd.merge(merged_data, df_holi, on='date')
# Merge Oil data to previous merged data on date column
merged_data3 = pd.merge(merged_data2, df_oil, on='date')
# Merge Store data to previous merged data on store_nbr column
merged_data4 = pd.merge(merged_data3, df_stores, on='store_nbr')
# Preview Merged data
merged_data4.head()
# Rename columns using the rename method
new_merged_data = merged_data4.rename(columns={"type_x": "holiday_type", "type_y": "store_type"})
# Preview of new merged data - top 10
new_merged_data.head()
# Preview of new merged data - bottom 10
new_merged_data.tail()
new_merged_data['year'].unique()
array([2013, 2014, 2015, 2016, 2017], dtype=int64)
# Datatypes of new merged data
new_merged_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 322047 entries, 0 to 322046
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 322047 non-null int64
1 date 322047 non-null datetime64[ns]
2 store_nbr 322047 non-null int64
3 family 322047 non-null object
4 sales 322047 non-null float64
5 onpromotion 322047 non-null int64
6 transactions 322047 non-null int64
7 holiday_type 322047 non-null object
8 locale 322047 non-null object
9 locale_name 322047 non-null object
10 description 322047 non-null object
11 transferred 322047 non-null bool
12 dcoilwtico 300003 non-null float64
13 city 322047 non-null object
14 state 322047 non-null object
15 store_type 322047 non-null object
16 cluster 322047 non-null int64
dtypes: bool(1), datetime64[ns](1), float64(2), int64(5), object(8)
memory usage: 42.1+ MB
# Inspect data for null values
new_merged_data.isnull().sum()
id 0
date 0
store_nbr 0
family 0
sales 0
onpromotion 0
transactions 0
holiday_type 0
locale 0
locale_name 0
description 0
transferred 0
dcoilwtico 22044
city 0
state 0
store_type 0
cluster 0
dtype: int64
# Preview of shape of new merged data
new_merged_data.shape
(322047, 17)
#change date datatype as datetime to create new features
new_merged_data.date = pd.to_datetime(new_merged_data.date)
new_merged_data['year'] = new_merged_data.date.dt.year
new_merged_data['month'] = new_merged_data.date.dt.month
new_merged_data['dayofmonth'] = new_merged_data.date.dt.day
new_merged_data['dayofweek'] = new_merged_data.date.dt.dayofweek
new_merged_data['dayname'] = new_merged_data.date.dt.strftime('%A')
Answering Questions
1. Is the train dataset complete (has all the required dates)?
# Check for missing values
if df_train.isnull().values.any():
print("The dataset is not complete. There are missing values.")
# Check for missing dates in a time-series dataset
if not df_train.index.is_unique:
print("The dataset is not complete. There are duplicate dates.")
else:
print("The dataset is complete.")
The dataset is complete.
2. Which dates have the lowest and highest sales for each year?
# Group the data by year and get the minimum and maximum sales for each year
grouped_by_year = new_merged_data.groupby("year")["sales"].agg(["min", "max"])
# Get the dates corresponding to the minimum and maximum sales for each year
result = pd.concat([new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "min"]][["year", "date"]].rename(columns={"date": "date_min"}) for year in grouped_by_year.index] +
[new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "max"]][["year", "date"]].rename(columns={"date": "date_max"}) for year in grouped_by_year.index])
# Set the index to be the year
result = result.set_index("year")
# Group the data by year to get the minimum and maximum sales on separate rows
result = result.groupby(level=0).agg({"date_min": "first", "date_max": "first"})
# Reset the index to get a regular dataframe
result = result.reset_index()
print(result)
year date_min date_max
0 2013 2013-01-01 2013-11-12
1 2014 2014-01-01 2014-12-08
2 2015 2015-01-01 2015-11-11
3 2016 2016-02-08 2016-05-02
4 2017 2017-01-02 2017-01-02
# Group the data by year and get the minimum and maximum sales for each year
grouped_by_year = new_merged_data.groupby("year")["sales"].agg(["min", "max"])
# Get the dates corresponding to the minimum and maximum sales for each year
result = pd.concat([new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "min"]][["year", "date"]].rename(columns={"date": "date_min"}) for year in grouped_by_year.index] +
[new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "max"]][["year", "date"]].rename(columns={"date": "date_max"}) for year in grouped_by_year.index])
# Set the index to be the year
result = result.set_index("year")
# Group the data by year to get the minimum and maximum sales on separate rows
result = result.groupby(level=0).agg({"date_min": "first", "date_max": "first"})
# Reset the index to get a regular dataframe
result = result.reset_index()
# Plot the minimum and maximum sales for each year
plt.plot(result["year"], grouped_by_year["min"], label="Minimum Sales")
plt.plot(result["year"], grouped_by_year["max"], label="Maximum Sales")
# Add a legend
plt.legend()
# Add axis labels
plt.xlabel("Year")
plt.ylabel("Sales")
# Show the plot
plt.show()
3. Are certain groups of stores selling more products? (Cluster, city, state, type)
#display random sample of 5 rows
df_stores.sample(5, random_state = 0)
# Plot the number of stores by city
plt.figure(figsize=(10, 5))
sns.countplot(x='city', data=df_stores)
# Add title and labels
plt.title("Number of Stores by City")
plt.xlabel("City")
plt.xticks(rotation=45)
plt.ylabel("Number of Stores")
# Show the plot
plt.show()
# Plot the number of stores by state
plt.figure(figsize=(10, 5))
sns.countplot(x='state', data=df_stores)
# Add title and labels
plt.title("Number of Stores by State")
plt.xlabel("State")
plt.xticks(rotation=45)
plt.ylabel("Number of Stores")
# Show the plot
plt.show()
# Plot the number of stores by type
plt.figure(figsize=(10, 5))
sns.countplot(x='type', data=df_stores)
# Add title and labels
plt.title("Number of Stores by Type")
plt.xlabel("Type")
plt.ylabel("Number of Stores")
# Show the plot
plt.show()
# Plot the number of stores by cluster
plt.figure(figsize=(10, 5))
sns.countplot(x='cluster', data=df_stores)
# Add title and labels
plt.title("Number of Stores by Cluster")
plt.xlabel("Cluster")
plt.ylabel("Number of Stores")
# Show the plot
plt.show()
5. What analysis can we get from the date and its extractable features?
# create a copy of the dataframe
df_train_copy = df_train.copy()
# extract year, quarter, month, day, and weekday information from the date column
df_train_copy['year'] = df_train_copy['date'].dt.year
df_train_copy['quarter'] = df_train_copy['date'].dt.quarter
df_train_copy['month'] = df_train_copy['date'].dt.month
df_train_copy['day'] = df_train_copy['date'].dt.day
df_train_copy['weekday'] = df_train_copy['date'].dt.weekday
# group sales data by year
grouped_by_year = df_train_copy.groupby('year').sum()
# plot the aggregated sales data by year
plt.plot(grouped_by_year.index, grouped_by_year['sales'])
plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Sales by Year")
plt.show()
# group sales data by month
grouped_by_month = df_train_copy.groupby('month').sum()
# plot the aggregated sales data by month
plt.bar(grouped_by_month.index, grouped_by_month['sales'])
plt.xlabel("Month")
plt.ylabel("Sales")
plt.title("Sales by Month")
plt.show()
# group sales data by year
grouped_by_quarter = df_train_copy.groupby('quarter').sum()
# plot the aggregated sales data by quarter
plt.plot(grouped_by_quarter.index, grouped_by_quarter['sales'])
plt.xlabel("quarter")
plt.ylabel("Sales")
plt.title("Sales by Quarter")
plt.show()
7. What is the relationship between oil prices and sales?
# Plot a scatter plot to visualize the relationship between oil prices and sales
plt.scatter(new_merged_data['dcoilwtico'], new_merged_data['sales'])
plt.xlabel('Oil Price')
plt.ylabel('Sales')
plt.title('Relationship between Oil Prices and Sales')
plt.show()
8. What is the relationship between product and sales?
# Group data by product family and sum the sales
grouped_data_1 = new_merged_data.groupby('family').sum()['sales']
# Sort the data by sales
grouped_data_1 = grouped_data_1.sort_values(ascending=False)
# Plot the top 10 product families
sns.barplot(x=grouped_data_1.index[:10], y=grouped_data_1.values[:10])
# Add labels and title
plt.xlabel('Product Family')
plt.ylabel('Sales')
plt.title('Relationship between Product Family and Sales (Top 10)')
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)
# Show the plot
plt.show()
9. What is the trend of sales overtime
# Group data by date and sum the sales
date_group = new_merged_data.groupby("date").sum()
# Plot the sales over time
plt.figure(figsize=(12,5))
plt.plot(date_group.index, date_group["sales"])
plt.title("Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Feature Processing & Engineering
This section is to?clean,?process?the dataset and?create new features.
Drop Duplicates
#checking duplicates in the train data
new_merged_data.duplicated().sum()
0
# Drop the specified columns
new_merged_data = new_merged_data.drop(columns=["year", "month", "dayofmonth", "dayofweek", "dayname"])
new_merged_data
New Features Creation
#change date datatype as datetime to create new features
new_merged_data.date = pd.to_datetime(new_merged_data.date)
new_merged_data['year'] = new_merged_data.date.dt.year
new_merged_data['month'] = new_merged_data.date.dt.month
new_merged_data['dayofmonth'] = new_merged_data.date.dt.day
new_merged_data['dayofweek'] = new_merged_data.date.dt.dayofweek
new_merged_data['dayname'] = new_merged_data.date.dt.strftime('%A')
# Preview data with new features
new_merged_data.head()
Impute Missing Values
from sklearn.impute import SimpleImputer
# create an instance of the SimpleImputer class with mean strategy
imputer = SimpleImputer(strategy='mean')
# fit the imputer to the dcoilwtico column of new_merged_data
imputer.fit(new_merged_data[['dcoilwtico']])
# use the imputer to transform the dcoilwtico column of new_merged_data, replacing missing values with the mean value
new_merged_data['dcoilwtico'] = imputer.transform(new_merged_data[['dcoilwtico']])
# Preview data columns after imputing
new_merged_data.isnull().sum()
id 0
date 0
store_nbr 0
family 0
sales 0
onpromotion 0
transactions 0
holiday_type 0
locale 0
locale_name 0
description 0
transferred 0
dcoilwtico 0
city 0
state 0
store_type 0
cluster 0
year 0
month 0
dayofmonth 0
dayofweek 0
dayname 0
dtype: int64
# Write the DataFrame to a CSV file
new_merged_data.to_csv('new_merged_data.csv', index=False)
#drop unnecessary columns
final_data = new_merged_data.drop(columns=['id','locale', 'locale_name', 'description', 'transferred'], inplace=True)
new_merged_data.head()
# set the date column as the index
new_merged_data.set_index('date', inplace=True)
new_merged_data.head()
# drop more columns
final_data = new_merged_data.drop(columns=['state', 'store_type', 'dayname'], inplace=True)
final_data = new_merged_data.copy()
final_data.head()
# categorizing the products
food_families = ['BEVERAGES', 'BREAD/BAKERY', 'FROZEN FOODS', 'MEATS', 'PREPARED FOODS', 'DELI','PRODUCE', 'DAIRY','POULTRY','EGGS','SEAFOOD']
final_data['family'] = np.where(final_data['family'].isin(food_families), 'FOODS', final_data['family'])
home_families = ['HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES']
final_data['family'] = np.where(final_data['family'].isin(home_families), 'HOME', final_data['family'])
clothing_families = ['LINGERIE', 'LADYSWARE']
final_data['family'] = np.where(final_data['family'].isin(clothing_families), 'CLOTHING', final_data['family'])
grocery_families = ['GROCERY I', 'GROCERY II']
final_data['family'] = np.where(final_data['family'].isin(grocery_families), 'GROCERY', final_data['family'])
stationery_families = ['BOOKS', 'MAGAZINES','SCHOOL AND OFFICE SUPPLIES']
final_data['family'] = np.where(final_data['family'].isin(stationery_families), 'STATIONERY', final_data['family'])
cleaning_families = ['HOME CARE', 'BABY CARE','PERSONAL CARE']
final_data['family'] = np.where(final_data['family'].isin(cleaning_families), 'CLEANING', final_data['family'])
hardware_families = ['PLAYERS AND ELECTRONICS','HARDWARE']
final_data['family'] = np.where(final_data['family'].isin(hardware_families), 'HARDWARE', final_data['family'])
from sklearn.preprocessing import StandardScaler
# create an instance of StandardScaler
scaler = StandardScaler()
# select numerical columns
num_cols = ['sales', 'transactions', 'dcoilwtico', 'year', 'month', 'dayofmonth', 'dayofweek']
# fit and transform the numerical columns
final_data[num_cols] = scaler.fit_transform(final_data[num_cols])
Features Encoding
from sklearn.preprocessing import OneHotEncoder
# Select the categorical columns
categorical_columns = ["family", "city", "holiday_type"]
categorical_data = final_data[categorical_columns]
# Initialize the OneHotEncoder
encoder = OneHotEncoder()
# Fit and transform the data to one hot encoding
one_hot_encoded_data = encoder.fit_transform(categorical_data)
# Get the categories for each column
categories = [encoder.categories_[i] for i in range(len(encoder.categories_))]
# Create the column names for the one hot encoded data
column_names = []
for i in range(len(categories)):
for j in range(len(categories[i])):
column_names.append(f'{categorical_columns[i]}_{categories[i][j]}')
# Convert the one hot encoding data to a DataFrame
one_hot_encoded_data = pd.DataFrame(one_hot_encoded_data.toarray(), columns=column_names)
# Reset the index of both dataframes
final_data = final_data.reset_index(drop=True)
one_hot_encoded_data = one_hot_encoded_data.reset_index(drop=True)
# Concatenate the original dataframe with the one hot encoded data
final_data_encoded = pd.concat([final_data, one_hot_encoded_data], axis=1)
# Drop the original categorical columns
final_data_encoded.drop(categorical_columns, axis=1, inplace=True)
final_data_encoded.head()
#Rename dcoilwtico column to oil price
final_data_encoded.rename(columns={'dcoilwtico':'oil_price'}, inplace=True)
final_data_encoded.head()
# Make a copy of the final_data_encoded as data
data = final_data_encoded.copy()
data.head()
fig, ax = plt.subplots(figsize=(16, 11))
ax.plot(new_merged_data['sales'])
ax.set_xlabel('Time')
ax.set_ylabel('Sales')
fig.autofmt_xdate()
plt.tight_layout()
# Write the DataFrame to a CSV file
data.to_csv('encoded_data.csv', index=False)
Machine Learning Modeling
Here is the section to build, train, evaluate and compare the models to each others.
Evaluating the Model:
- We will use metrics such as mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE) to evaluate the model’s performance.
- We will compare the model’s predictions with the actual sales data to analyze its accuracy and identify any discrepancies.
- If the model’s metrics do not meet the business requirements, we will reframe the problem and try to improve the model’s performance by adding more data samples and features or selecting a different algorithm.
Simple Model #001
Please, keep the following structure to try all the model you want.
Create and Train the Model
# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_squared_log_error
# Split Data to train and Test
from sklearn.model_selection import train_test_split
# Create the feature dataframe using the selected columns
X = data.drop(["sales"], axis=1)
# Get the target variable
y = data.sales
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Linear Regression Model
# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Make prediction on X_test
lr_predictions = lr.predict(X_test)
plt.scatter(y_test, lr_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Linear Regression")
plt.show()
# Evaluation Metrics for Linear Regression
lr_mse = mean_squared_error(y_test, lr_predictions).round(2)
lr_rmse = np.sqrt(lr_mse).round(2)
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
lr_predictions_abs = abs(lr_predictions)
# calculate the mean squared logarithmic error using the new y_test_abs and lr_predictions_abs array
lr_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, lr_predictions_abs)).round(2)
# Print the evaluation results for Linear Regression model
print("\nEvaluation Results for Linear Regression:")
print("MSE:", lr_mse)
print("RMSE:", lr_rmse)
print("RMSLE:", lr_rmsle)
Evaluation Results for Linear Regression:
MSE: 0.72
RMSE: 0.85
RMSLE: 0.26
Decision Tree Regression Model
# Decision Tree Regression Model
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
# Make prediction on X_test
dt_predictions = dt.predict(X_test)
plt.scatter(y_test, dt_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Decision Tree Regression")
plt.show()
# Evaluation Metrics for Decision Tree Regression
dt_mse = mean_squared_error(y_test, dt_predictions).round(2)
dt_rmse = np.sqrt(dt_mse).round(2)
# apply the absolute value function to y_test to remove negative signs
#y_test_abs = abs(y_test)
dt_predictions_abs = abs(dt_predictions)
# calculate the mean squared logarithmic error using the new y_test_abs and dt_predictions_abs array
dt_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, dt_predictions_abs)).round(2)
# Print the evaluation results for Decision Tree Regression model
print("\nEvaluation Results for Decision Tree Regression:")
print("MSE:", dt_mse)
print("RMSE:", dt_rmse)
print("RMLSE:", dt_rmsle)
Conclusion:
A brief summary of the research findings:
· A technique for examining data that evolves over time is time series analysis. In this project, we examined patterns and trends in the data across time rather than individual data points or values. This was helpful in understanding how sales varied over time and what influences on sales there were.
· We noticed that Pichincha had the most retailers, which led to a rise in sales for Quito, the state with the highest sales.
· We also observed that the earthquake had such a significant impact on sales that there was a sharp increase in sales right before the disaster. Another pattern we observed was that Saturday — and Sunday were the busiest days for sales.
· Time series analysis can be used to look for trends in data across days, weeks, or months in order to understand how the number of visitors to a website varies over time. Using historical data, we could utilize this information to forecast the number of visitors we can anticipate in the future.
· In general, time series analysis is a great tool for comprehending how things change over time and can aid us in improving our future projections.
Note:?By analyzing the store sales time series data and building a forecasting model, we can accurately predict future sales and help the store management plan their inventory and sales strategies. The model’s performance should be evaluated based on its ability to meet the business requirements, and any issues with the data should be addressed appropriately to improve the model’s accuracy.