Time Series Forecasting And Analysis Of Store Sales Of Corporation Favorita Products-Regression Findings.

Jabo Justin

Technical Support Engineer at Micro Focus && at Tek-Experts (||Advanced Authentication ||Secure Login||Network Security Products Team),, ||Data Analyst|| Data Engineer|| (BI) Analyst|| Team Leader Manager At Azubi Africa

发布日期: 2023年5月15日

DESCRIPTION

This project aims to analyze and forecast the sales of a store based on the time series data from Corporation Favorita, a sizable supermarket retailer with headquarters in Ecuador, this research seeks to assess the store’s sales.

The objective is to create a model that can properly forecast future sales using the thousands of products sold at different Favorita locations, in order to assist the management of the store in formulating inventory and sales plans.

We will use sales forecasting to extrapolate future sales levels for a company using historical sales data because Favorita Corporation’s sales over the past four years have generated a lot of data. Our intention is to support business managers in making forecasts about the future based on this information, which has been kept on file for a set period of time following the event or after it occurred.

As part of this research, we will build models using historical analysis, formulate scientific hypotheses based on time-stamped historical data, and then utilize those models to make observations and direct strategic decision-making in the future. In order to enhance operations and ultimately sales, we would like to assist management at Favorita Corporation in gaining some insights from their data.

Methodology:

According to?IBM,?CRISP-DM, which stands for Cross-Industry Standard Process for Data Mining, is an industry-proven way to guide your data mining efforts.

It offers explanations of the common project phases, the tasks related to each phase, and the relationships between these tasks as a methodology.
As a process model,?CRISP-DM?provides a summary of the life cycle of data mining.

The CRISP-DMThe six stages of the data mining lifecycle are as follows:

Business Understanding:

The goal of this project is to increase sales by learning more about long-term sales trends. understand prior events, how they affected sales, what can be done to remedy them, and, if necessary, the next step. This research attempts to examine several regression approaches in order to produce some predictions.

Hypothesis:?The sales of the store are affected by various factors, such as the day of the week, season, promotions, and other external factors. By analyzing these factors and building a time series forecasting model, we can accurately predict the store’s future sales.

Null Hypothesis:?Some products have sold more than others in terms of revenue.

Alternative Hypothesis:?The retail company received an equal amount of revenue from each product.

Questions:

1. Is the train dataset complete (has all the required dates)?

2. Which dates have the lowest and highest sales for each year?

3. Did the earthquake impact sales?

4. Are certain groups of stores selling more products? (Cluster, city, state, type)

5. Are sales affected by promotions, oil prices and holidays?

6. What analysis can we get from the date and its extractable features?

7. What is the RMSLE, RMSE, MSLE, MSE from the ML model?

Our task is to create a model that can more precisely forecast the unit sales of numerous items.

Data Understanding:

Determining the data’s quality and having a general idea of the types of analyses that the data might be subjected to are essential components of any data science endeavor.

For our project, here are the various data sets and their description;

File descriptions as well as data field details:

train.csv

· Dates, store, product, and promotion information, along with sales figures, are all included in the training data. Additional files contain further data that you might find helpful while creating your models. The training data includes goal sales as well as time series for the features store_nbr, family, and onpromotion.

· store_nbr specifies the location of the retailer where the goods are sold.

· The family describes the category of sold goods.

· sales provides the overall sales for a family of products at a specific retailer on a specific day. Since things can be sold in fractional units (1.5 kg of cheese, as opposed to 1 bag of chips, for example), fractional values are possible.

· Onpromotion provides the total number of products in a family that were promoted in a store on a specific date.

test.csv

· Test data with the same characteristics as training data. For the dates in this file, you will forecast the goal sales.

· The dates in the test data are for the period of time that follows the final date in the training data by 15 days.

transaction.csv

· Consists of the date, store_nbr, and transaction for that particular day.

sample_submission.csv

· A submission file example in the appropriate format.

stores.csv

· Save metadata, such as cluster, type, city, and state.

· Cluster refers to a collection of related stores.

oil.csv

· The current day’s oil price, which takes into account prices from both the train and test data timeframes. (Because it depends so heavily on oil, Ecuador’s economy is extremely susceptible to fluctuations in oil prices.)

holidays_events.csv

·?Events and Holidays, with metadata

NOTE:?Focus in particular on the transferred column. A holiday that was shifted by the government officially falls on that day but is observed on a different day. A transferred day resembles a typical day more so than a holiday. Look for the relevant row where type is Transfer to determine the day when it was seen.

For instance, the Guayaquil holiday Independencia de Guayaquil was moved from 2012–10–09 to 2012–10–12, therefore it was observed on that day. Extra days that are added to a holiday (such as to stretch the break across a long weekend) are referred to as “Bridge Days.” These are commonly made up by the type Work Day, which is a day that is intended to pay back the Bridge but is not typically planned for work (such as a Saturday).

· Additional holidays are days that are added to a regular calendar holiday, such as when Christmas Eve is observed as a holiday.

Data Preparation

Data cleaning, exploratory data analysis (univariate and bivariate), computing missing values, feature engineering, etc. are typically included in the data preparation process.

Data Handling

To get started, you’ll need to import some packages: pandas for data manipulation, numpy and matplotlib for visualization, seaborn for styling, sklearn for feature processing and machine learning (including estimators, catboost, and lightGBM), and other packages such as os. All of these can be imported in one line with the following code:

To get started, you’ll need to import a few packages:

·?PANDAS?for data manipulation.

·?NUMPY?and matplotlib for visualization.

·?SEABORN?for styling.

· SKLEARN for feature processing and machine learning (including estimators, catboost, and lightGBM).

· and other packages such as OS.

With the following code, all of these may be imported in a single line:

Import Libraries

# Library for EDA
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from sklearn.impute import SimpleImputer
from ydata_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')
# Import datasets

df_sample_sub = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/sample_submission.csv')
df_stores = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/stores.csv')
df_trans = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/transactions.csv')
df_holi = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/holidays_events.csv')
df_oil = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/oil.csv')

#Loading train & test dataset

领英推荐

Analyzing Supply Chain Trends with TexPro: A…

Fibre2Fashion 1 个月前

Demand Forecasting with Predictive Analytics – A…

SDIntent Analytics Pvt. Ltd.(SalesDemand) 6 个月前

Transforming SKU-Level Forecasting with Predictive…

Aegis Graham Bell Award 2 个月前

df_train = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/train.csv')
df_test = pd.read_csv('C:/Users/Dell/Documents/PROJECT AZUBI/Career_Accelerator_LP2-Regression/test.csv')

For more details on the imported libraries, click the links below:

·?Numpy:

·?Pandas:

·?Seaborn:

·?Matplotlib.pyplot:

Issues with the Data (Data Issues):

Missing values: (43 null values in oil data before merge) We will fill the missing values using appropriate methods such as forward fill or backward fill.
More missing values after merging
Outliers: We will identify and remove the outliers by analyzing the data distribution and using appropriate statistical techniques.

Selecting the Right Model:

- We will start with a simple model such as ARIMA or Prophet and evaluate its performance.

- We will perform cross-validation and hyperparameter tuning to fine-tune the model’s parameters and improve its performance.

- If the model does not meet the business requirements, we will reframe the problem by adding more data samples and features or selecting a different algorithm that can better fit the data.

Merge all datasets for further EDA

# combine the datasets on common columns

merged_data = pd.merge(df_train, df_trans, on=['date', 'store_nbr'])
# Merge Holiday data to previous merged data on date column
merged_data2 = pd.merge(merged_data, df_holi, on='date')
# Merge Oil data to previous merged data on date column
merged_data3 = pd.merge(merged_data2, df_oil, on='date')
# Merge Store data to previous merged data on store_nbr column

merged_data4 = pd.merge(merged_data3, df_stores, on='store_nbr')
# Preview Merged data
merged_data4.head()
# Rename columns using the rename method
new_merged_data = merged_data4.rename(columns={"type_x": "holiday_type", "type_y": "store_type"})
# Preview of new merged data - top 10
new_merged_data.head()
# Preview of new merged data - bottom 10
new_merged_data.tail()
new_merged_data['year'].unique()

array([2013, 2014, 2015, 2016, 2017], dtype=int64)
# Datatypes of new merged data
new_merged_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 322047 entries, 0 to 322046
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   id            322047 non-null  int64         
 1   date          322047 non-null  datetime64[ns]
 2   store_nbr     322047 non-null  int64         
 3   family        322047 non-null  object        
 4   sales         322047 non-null  float64       
 5   onpromotion   322047 non-null  int64         
 6   transactions  322047 non-null  int64         
 7   holiday_type  322047 non-null  object        
 8   locale        322047 non-null  object        
 9   locale_name   322047 non-null  object        
 10  description   322047 non-null  object        
 11  transferred   322047 non-null  bool          
 12  dcoilwtico    300003 non-null  float64       
 13  city          322047 non-null  object        
 14  state         322047 non-null  object        
 15  store_type    322047 non-null  object        
 16  cluster       322047 non-null  int64         
dtypes: bool(1), datetime64[ns](1), float64(2), int64(5), object(8)
memory usage: 42.1+ MB
# Inspect data for null values
new_merged_data.isnull().sum()
id                  0
date                0
store_nbr           0
family              0
sales               0
onpromotion         0
transactions        0
holiday_type        0
locale              0
locale_name         0
description         0
transferred         0
dcoilwtico      22044
city                0
state               0
store_type          0
cluster             0
dtype: int64
# Preview of shape of new merged data
new_merged_data.shape
(322047, 17)

#change date datatype as datetime to create new features

new_merged_data.date = pd.to_datetime(new_merged_data.date)


new_merged_data['year'] = new_merged_data.date.dt.year

new_merged_data['month'] = new_merged_data.date.dt.month


new_merged_data['dayofmonth'] = new_merged_data.date.dt.day


new_merged_data['dayofweek'] = new_merged_data.date.dt.dayofweek


new_merged_data['dayname'] = new_merged_data.date.dt.strftime('%A')

Answering Questions

1. Is the train dataset complete (has all the required dates)?

# Check for missing values
if df_train.isnull().values.any():
  print("The dataset is not complete. There are missing values.")

# Check for missing dates in a time-series dataset
if not df_train.index.is_unique:
  print("The dataset is not complete. There are duplicate dates.")
else:
  print("The dataset is complete.")
The dataset is complete.

2. Which dates have the lowest and highest sales for each year?

# Group the data by year and get the minimum and maximum sales for each year
grouped_by_year = new_merged_data.groupby("year")["sales"].agg(["min", "max"])

# Get the dates corresponding to the minimum and maximum sales for each year
result = pd.concat([new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "min"]][["year", "date"]].rename(columns={"date": "date_min"}) for year in grouped_by_year.index] +
                  [new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "max"]][["year", "date"]].rename(columns={"date": "date_max"}) for year in grouped_by_year.index])

# Set the index to be the year
result = result.set_index("year")

# Group the data by year to get the minimum and maximum sales on separate rows
result = result.groupby(level=0).agg({"date_min": "first", "date_max": "first"})

# Reset the index to get a regular dataframe
result = result.reset_index()

print(result)
year   date_min   date_max
0  2013 2013-01-01 2013-11-12
1  2014 2014-01-01 2014-12-08
2  2015 2015-01-01 2015-11-11
3  2016 2016-02-08 2016-05-02
4  2017 2017-01-02 2017-01-02
# Group the data by year and get the minimum and maximum sales for each year
grouped_by_year = new_merged_data.groupby("year")["sales"].agg(["min", "max"])

# Get the dates corresponding to the minimum and maximum sales for each year
result = pd.concat([new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "min"]][["year", "date"]].rename(columns={"date": "date_min"}) for year in grouped_by_year.index] +
                  [new_merged_data[new_merged_data["sales"] == grouped_by_year.loc[year, "max"]][["year", "date"]].rename(columns={"date": "date_max"}) for year in grouped_by_year.index])

# Set the index to be the year
result = result.set_index("year")

# Group the data by year to get the minimum and maximum sales on separate rows
result = result.groupby(level=0).agg({"date_min": "first", "date_max": "first"})

# Reset the index to get a regular dataframe
result = result.reset_index()

# Plot the minimum and maximum sales for each year
plt.plot(result["year"], grouped_by_year["min"], label="Minimum Sales")
plt.plot(result["year"], grouped_by_year["max"], label="Maximum Sales")

# Add a legend
plt.legend()

# Add axis labels
plt.xlabel("Year")
plt.ylabel("Sales")

# Show the plot
plt.show()

3. Are certain groups of stores selling more products? (Cluster, city, state, type)

#display random sample of 5 rows

df_stores.sample(5, random_state = 0)
# Plot the number of stores by city
plt.figure(figsize=(10, 5))
sns.countplot(x='city', data=df_stores)

# Add title and labels
plt.title("Number of Stores by City")
plt.xlabel("City")
plt.xticks(rotation=45)
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

# Plot the number of stores by state
plt.figure(figsize=(10, 5))
sns.countplot(x='state', data=df_stores)

# Add title and labels
plt.title("Number of Stores by State")
plt.xlabel("State")
plt.xticks(rotation=45)
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

# Plot the number of stores by type
plt.figure(figsize=(10, 5))
sns.countplot(x='type', data=df_stores)

# Add title and labels
plt.title("Number of Stores by Type")
plt.xlabel("Type")
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

# Plot the number of stores by cluster
plt.figure(figsize=(10, 5))
sns.countplot(x='cluster', data=df_stores)

# Add title and labels
plt.title("Number of Stores by Cluster")
plt.xlabel("Cluster")
plt.ylabel("Number of Stores")

# Show the plot
plt.show()

5. What analysis can we get from the date and its extractable features?

# create a copy of the dataframe
df_train_copy = df_train.copy()


# extract year, quarter, month, day, and weekday information from the date column
df_train_copy['year'] = df_train_copy['date'].dt.year
df_train_copy['quarter'] = df_train_copy['date'].dt.quarter
df_train_copy['month'] = df_train_copy['date'].dt.month
df_train_copy['day'] = df_train_copy['date'].dt.day
df_train_copy['weekday'] = df_train_copy['date'].dt.weekday

# group sales data by year
grouped_by_year = df_train_copy.groupby('year').sum()

# plot the aggregated sales data by year
plt.plot(grouped_by_year.index, grouped_by_year['sales'])
plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Sales by Year")
plt.show()

# group sales data by month
grouped_by_month = df_train_copy.groupby('month').sum()

# plot the aggregated sales data by month
plt.bar(grouped_by_month.index, grouped_by_month['sales'])
plt.xlabel("Month")
plt.ylabel("Sales")
plt.title("Sales by Month")
plt.show()


# group sales data by year
grouped_by_quarter = df_train_copy.groupby('quarter').sum()

# plot the aggregated sales data by quarter
plt.plot(grouped_by_quarter.index, grouped_by_quarter['sales'])
plt.xlabel("quarter")
plt.ylabel("Sales")
plt.title("Sales by Quarter")
plt.show()

7. What is the relationship between oil prices and sales?

# Plot a scatter plot to visualize the relationship between oil prices and sales
plt.scatter(new_merged_data['dcoilwtico'], new_merged_data['sales'])
plt.xlabel('Oil Price')
plt.ylabel('Sales')
plt.title('Relationship between Oil Prices and Sales')
plt.show()

8. What is the relationship between product and sales?

# Group data by product family and sum the sales
grouped_data_1 = new_merged_data.groupby('family').sum()['sales']

# Sort the data by sales
grouped_data_1 = grouped_data_1.sort_values(ascending=False)

# Plot the top 10 product families
sns.barplot(x=grouped_data_1.index[:10], y=grouped_data_1.values[:10])

# Add labels and title
plt.xlabel('Product Family')
plt.ylabel('Sales')
plt.title('Relationship between Product Family and Sales (Top 10)')
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.show()

9. What is the trend of sales overtime

# Group data by date and sum the sales
date_group = new_merged_data.groupby("date").sum()

# Plot the sales over time
plt.figure(figsize=(12,5))
plt.plot(date_group.index, date_group["sales"])
plt.title("Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Feature Processing & Engineering

This section is to?clean,?process?the dataset and?create new features.

Drop Duplicates

#checking duplicates in the train data

new_merged_data.duplicated().sum()
0
# Drop the specified columns
new_merged_data = new_merged_data.drop(columns=["year", "month", "dayofmonth", "dayofweek", "dayname"])
new_merged_data

New Features Creation

#change date datatype as datetime to create new features

new_merged_data.date = pd.to_datetime(new_merged_data.date)


new_merged_data['year'] = new_merged_data.date.dt.year

new_merged_data['month'] = new_merged_data.date.dt.month


new_merged_data['dayofmonth'] = new_merged_data.date.dt.day


new_merged_data['dayofweek'] = new_merged_data.date.dt.dayofweek


new_merged_data['dayname'] = new_merged_data.date.dt.strftime('%A')
# Preview data with new features
new_merged_data.head()

Impute Missing Values

from sklearn.impute import SimpleImputer

# create an instance of the SimpleImputer class with mean strategy
imputer = SimpleImputer(strategy='mean')

# fit the imputer to the dcoilwtico column of new_merged_data
imputer.fit(new_merged_data[['dcoilwtico']])

# use the imputer to transform the dcoilwtico column of new_merged_data, replacing missing values with the mean value
new_merged_data['dcoilwtico'] = imputer.transform(new_merged_data[['dcoilwtico']])
# Preview data columns after imputing
new_merged_data.isnull().sum()
id              0
date            0
store_nbr       0
family          0
sales           0
onpromotion     0
transactions    0
holiday_type    0
locale          0
locale_name     0
description     0
transferred     0
dcoilwtico      0
city            0
state           0
store_type      0
cluster         0
year            0
month           0
dayofmonth      0
dayofweek       0
dayname         0
dtype: int64
# Write the DataFrame to a CSV file
new_merged_data.to_csv('new_merged_data.csv', index=False)

#drop unnecessary columns

final_data = new_merged_data.drop(columns=['id','locale', 'locale_name', 'description', 'transferred'], inplace=True)
new_merged_data.head()
# set the date column as the index
new_merged_data.set_index('date', inplace=True)
new_merged_data.head()
# drop more columns

final_data = new_merged_data.drop(columns=['state',  'store_type', 'dayname'], inplace=True)
final_data = new_merged_data.copy()
final_data.head()
# categorizing the products
food_families = ['BEVERAGES', 'BREAD/BAKERY', 'FROZEN FOODS', 'MEATS', 'PREPARED FOODS', 'DELI','PRODUCE', 'DAIRY','POULTRY','EGGS','SEAFOOD']
final_data['family'] = np.where(final_data['family'].isin(food_families), 'FOODS', final_data['family'])
home_families = ['HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES']
final_data['family'] = np.where(final_data['family'].isin(home_families), 'HOME', final_data['family'])
clothing_families = ['LINGERIE', 'LADYSWARE']
final_data['family'] = np.where(final_data['family'].isin(clothing_families), 'CLOTHING', final_data['family'])
grocery_families = ['GROCERY I', 'GROCERY II']
final_data['family'] = np.where(final_data['family'].isin(grocery_families), 'GROCERY', final_data['family'])
stationery_families = ['BOOKS', 'MAGAZINES','SCHOOL AND OFFICE SUPPLIES']
final_data['family'] = np.where(final_data['family'].isin(stationery_families), 'STATIONERY', final_data['family'])
cleaning_families = ['HOME CARE', 'BABY CARE','PERSONAL CARE']
final_data['family'] = np.where(final_data['family'].isin(cleaning_families), 'CLEANING', final_data['family'])
hardware_families = ['PLAYERS AND ELECTRONICS','HARDWARE']
final_data['family'] = np.where(final_data['family'].isin(hardware_families), 'HARDWARE', final_data['family'])
from sklearn.preprocessing import StandardScaler

# create an instance of StandardScaler
scaler = StandardScaler()

# select numerical columns
num_cols = ['sales', 'transactions', 'dcoilwtico', 'year', 'month', 'dayofmonth', 'dayofweek']

# fit and transform the numerical columns
final_data[num_cols] = scaler.fit_transform(final_data[num_cols])

Features Encoding

from sklearn.preprocessing import OneHotEncoder

# Select the categorical columns
categorical_columns = ["family", "city", "holiday_type"]
categorical_data = final_data[categorical_columns]

# Initialize the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the data to one hot encoding
one_hot_encoded_data = encoder.fit_transform(categorical_data)

# Get the categories for each column
categories = [encoder.categories_[i] for i in range(len(encoder.categories_))]

# Create the column names for the one hot encoded data
column_names = []
for i in range(len(categories)):
    for j in range(len(categories[i])):
        column_names.append(f'{categorical_columns[i]}_{categories[i][j]}')

# Convert the one hot encoding data to a DataFrame
one_hot_encoded_data = pd.DataFrame(one_hot_encoded_data.toarray(), columns=column_names)


# Reset the index of both dataframes
final_data = final_data.reset_index(drop=True)
one_hot_encoded_data = one_hot_encoded_data.reset_index(drop=True)

# Concatenate the original dataframe with the one hot encoded data
final_data_encoded = pd.concat([final_data, one_hot_encoded_data], axis=1)

# Drop the original categorical columns
final_data_encoded.drop(categorical_columns, axis=1, inplace=True)
final_data_encoded.head()

#Rename dcoilwtico column to oil price

final_data_encoded.rename(columns={'dcoilwtico':'oil_price'}, inplace=True)
final_data_encoded.head()
# Make a copy of the final_data_encoded as data
data = final_data_encoded.copy()
data.head()
fig, ax = plt.subplots(figsize=(16, 11))
ax.plot(new_merged_data['sales'])
ax.set_xlabel('Time')
ax.set_ylabel('Sales')
fig.autofmt_xdate()
plt.tight_layout()
# Write the DataFrame to a CSV file
data.to_csv('encoded_data.csv', index=False)

Machine Learning Modeling

Here is the section to build, train, evaluate and compare the models to each others.

Evaluating the Model:

- We will use metrics such as mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE) to evaluate the model’s performance.

- We will compare the model’s predictions with the actual sales data to analyze its accuracy and identify any discrepancies.

- If the model’s metrics do not meet the business requirements, we will reframe the problem and try to improve the model’s performance by adding more data samples and features or selecting a different algorithm.

Simple Model #001

Please, keep the following structure to try all the model you want.

Create and Train the Model

# Import libraries
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_squared_log_error
# Split Data to train and Test
from sklearn.model_selection import train_test_split

# Create the feature dataframe using the selected columns
X = data.drop(["sales"], axis=1)

# Get the target variable
y = data.sales

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Linear Regression Model

# Linear Regression Model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Make prediction on X_test
lr_predictions = lr.predict(X_test)
plt.scatter(y_test, lr_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Linear Regression")
plt.show()
# Evaluation Metrics for Linear Regression
lr_mse = mean_squared_error(y_test, lr_predictions).round(2)
lr_rmse = np.sqrt(lr_mse).round(2)
# apply the absolute value function to y_test to remove negative signs
y_test_abs = abs(y_test)
lr_predictions_abs = abs(lr_predictions)
# calculate the mean squared logarithmic error using the new y_test_abs and lr_predictions_abs array
lr_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, lr_predictions_abs)).round(2)
# Print the evaluation results for Linear Regression model
print("\nEvaluation Results for Linear Regression:")
print("MSE:", lr_mse)
print("RMSE:", lr_rmse)
print("RMSLE:", lr_rmsle)
Evaluation Results for Linear Regression:
MSE: 0.72
RMSE: 0.85
RMSLE: 0.26

Decision Tree Regression Model

# Decision Tree Regression Model
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

# Make prediction on X_test
dt_predictions = dt.predict(X_test)
plt.scatter(y_test, dt_predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("Decision Tree Regression")
plt.show()
# Evaluation Metrics for Decision Tree Regression
dt_mse = mean_squared_error(y_test, dt_predictions).round(2)
dt_rmse = np.sqrt(dt_mse).round(2)
# apply the absolute value function to y_test to remove negative signs
#y_test_abs = abs(y_test)
dt_predictions_abs = abs(dt_predictions)
# calculate the mean squared logarithmic error using the new y_test_abs and dt_predictions_abs array

dt_rmsle = np.sqrt(mean_squared_log_error(y_test_abs, dt_predictions_abs)).round(2)
# Print the evaluation results for Decision Tree Regression model
print("\nEvaluation Results for Decision Tree Regression:")
print("MSE:", dt_mse)
print("RMSE:", dt_rmse)

print("RMLSE:", dt_rmsle)

Conclusion:

A brief summary of the research findings:

· A technique for examining data that evolves over time is time series analysis. In this project, we examined patterns and trends in the data across time rather than individual data points or values. This was helpful in understanding how sales varied over time and what influences on sales there were.

· We noticed that Pichincha had the most retailers, which led to a rise in sales for Quito, the state with the highest sales.

· We also observed that the earthquake had such a significant impact on sales that there was a sharp increase in sales right before the disaster. Another pattern we observed was that Saturday — and Sunday were the busiest days for sales.

· Time series analysis can be used to look for trends in data across days, weeks, or months in order to understand how the number of visitors to a website varies over time. Using historical data, we could utilize this information to forecast the number of visitors we can anticipate in the future.

· In general, time series analysis is a great tool for comprehending how things change over time and can aid us in improving our future projections.

Note:?By analyzing the store sales time series data and building a forecasting model, we can accurately predict future sales and help the store management plan their inventory and sales strategies. The model’s performance should be evaluated based on its ability to meet the business requirements, and any issues with the data should be addressed appropriately to improve the model’s accuracy.

You can find more details about this project on my?GitHub?repository or visit my medium account if you're interested in doing so.

要查看或添加评论，请登录

Jabo Justin的更多文章

DEVELOPING SEPSIS PREDICTION WEB APPLICATION, WITH THE HELP OF MACHINE LEARNING AND FASTAPI PROJECT. CATEGORIZATION.

2023年12月21日

DEVELOPING SEPSIS PREDICTION WEB APPLICATION, WITH THE HELP OF MACHINE LEARNING AND FASTAPI PROJECT. CATEGORIZATION.

1.0 What is Sepsis? The body’s reaction to an infection can harm its own tissues and organs, leading to the potentially…
Using the Streamlit application and machine learning project, forecast and analyze customer Churn, to lower the rate of leaving the organization.

2023年11月5日

Using the Streamlit application and machine learning project, forecast and analyze customer Churn, to lower the rate of leaving the organization.

1.0 Introduction (What is Customer Churn?) Customer churn: When clients or subscribers stop using a company or service,…
Natural language processing NLP implementation using the BERT Sentiment Analysis App

2023年10月29日

Natural language processing NLP implementation using the BERT Sentiment Analysis App

What is Sentiment Analysis? Sentiment analysis A natural language processing technique called sentiment analysis can be…
Sales Analysis and Forecasting Of The Grocery Stores Using the Gradio, Streamlit application and machine learning project.

2023年10月23日

Sales Analysis and Forecasting Of The Grocery Stores Using the Gradio, Streamlit application and machine learning project.

Sales Analysis and Forecasting Of The Grocery Stores Azubian, a firm with grocery stores all over Africa, has recently…
BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation.

2023年7月30日

BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation.

BERT Sentiment Analysis App BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation…
Streamlit Machine Leaning app

2023年7月23日

Streamlit Machine Leaning app

Streamlit Machine Leaning app Description Streamlit: is an open-source Python library. with the aid of "Streamlit" it…
GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN

2023年7月10日

GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN

GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN Customer churn is a significant issue for many companies, particularly…
Predicting Telco Customer Churn and findings from data analysis: Using machine learning.

2023年6月9日

Predicting Telco Customer Churn and findings from data analysis: Using machine learning.

Introduction: For the purpose of analyzing customer attrition at a telecoms firm (Telco), this repository contains code…
Analysis of Funding Trends in the Indian Startup Ecosystem

2023年4月8日

Analysis of Funding Trends in the Indian Startup Ecosystem

Analysis of Funding Trends in the Indian Startup Ecosystem Project Description The goal of this initiative is to…

4 条评论

See all articles

DESCRIPTION

Methodology:

The CRISP-DMThe six stages of the data mining lifecycle are as follows:

Business Understanding:

Questions:

Data Understanding:

For our project, here are the various data sets and their description;

train.csv

test.csv

transaction.csv

sample_submission.csv

· A submission file example in the appropriate format.

stores.csv

oil.csv

holidays_events.csv

Data Preparation

Data Handling

Import Libraries

领英推荐

For more details on the imported libraries, click the links below:

Issues with the Data (Data Issues):

Selecting the Right Model:

Merge all datasets for further EDA

Answering Questions

1. Is the train dataset complete (has all the required dates)?

2. Which dates have the lowest and highest sales for each year?

3. Are certain groups of stores selling more products? (Cluster, city, state, type)

5. What analysis can we get from the date and its extractable features?

7. What is the relationship between oil prices and sales?

8. What is the relationship between product and sales?

9. What is the trend of sales overtime

Feature Processing & Engineering

Drop Duplicates

New Features Creation

Impute Missing Values

Features Encoding

Machine Learning Modeling

Evaluating the Model:

Simple Model #001

Create and Train the Model

Linear Regression Model

Decision Tree Regression Model

Conclusion:

A brief summary of the research findings:

Jabo Justin的更多文章

DEVELOPING SEPSIS PREDICTION WEB APPLICATION, WITH THE HELP OF MACHINE LEARNING AND FASTAPI PROJECT. CATEGORIZATION.

Using the Streamlit application and machine learning project, forecast and analyze customer Churn, to lower the rate of leaving the organization.

Natural language processing NLP implementation using the BERT Sentiment Analysis App

Sales Analysis and Forecasting Of The Grocery Stores Using the Gradio, Streamlit application and machine learning project.

BERT Sentiment Analysis App based Natural Language Processing (NLP) implementation.

Streamlit Machine Leaning app

GRADIO APP FOR PREDICTING TELCO CUSTOMER CHURN

Predicting Telco Customer Churn and findings from data analysis: Using machine learning.

Analysis of Funding Trends in the Indian Startup Ecosystem

社区洞察

其他会员也浏览了

Deep Diving into Retail Data Analytics: How Can Businesses Benefit from It?

Data Integration in Action: Real-World Use Cases Across Industries

Solving Retail problem using AI based master data management

Importance Of Advanced Data Analytics In The CPG Industry

Retail Analytics Market expanding at an eloquent rate of 19.4% with an anticipated USD 16,350.8 Billion by 2027

Retail Analytics Market Dynamics Explored, Consumer Behavior and Technologies Trends

Frontline Engage!

The Future of Retail Analytics: Predictive Modeling in MS Access Web Apps

Use Case: Real-Time Supply Chain Visibility and Optimization with Snowflake and Nexgensis

Key Insights CEOs Seek from a Retail Business Dashboard