Descriptive Data Analysis of Sales Data
Ahmad Nawaz
Data | Python | SQL | R | Power BI | Data Visualisation | Data Storytelling | Data Analysis
This project/case study is part of my data analytics portfolio. I conducted descriptive data analysis of sales data. I conducted this analysis using R programming language. I also did an analysis of the same data using SQL but i will write a separate article for that, this article contains analysis only using R. Data consisted of 3 CSV files. Entire data analysis was done in the RStudio.
Setting up Environment
Installing Packages
install.packages("tidyverse")
install.packages("dplyr")
install.packages("lubridate")
Loading Libraries
library(tidyverse)
library(dplyr)
library(lubridate)
library(ggplot2)
Loading Data Files
Orders_List <- read.csv("ListOfOrders.csv")
Order_Details <- read.csv("OrderDetails.csv")
Sales_Target <- read.csv("SalesTarget.csv")
Data Cleaning
Checking Duplicates in all data files
sum(duplicated(Orders_List))
sum(duplicated(Order_Details))
sum(duplicated(Sales_Target))
When i was exploring data in the excel file i noticed that total number of rows in "List of Orders" are 500 but when i checked total number of rows in RStudio it showed 560 rows this means there were 60 blank rows in "List of Orders" file.
So i removed blank rows from "List of Orders" data file.
blankrows <- apply(Orders_List == "", 1, all)
Orders_List <- Orders_List[!blankrows, ]
Then i checked duplicate values in each data file.
sum(duplicated(Orders_List))
sum(duplicated(Order_Details))
sum(duplicated(Sales_Target))
No duplicate values were found in any of the files.
Then i converted "Order.Date" column of the List of Orders table to the default data format of R language because later on while analyzing and digging down the data it will create problems.
Orders_List$Order.Date <- as.Date(Orders_List$Order.Date, "%d-%m-%Y")
Analysis
I created a new data frame "Combined" by performing inner join on Orders_List and Order_Details as none of the files have all the required columns needed for data analysis in one single file.
Combined <- merge(Orders_List, Order_Details, by = "Order.ID", all = FALSE)
1 - Finding top 10 cities with highest profit along with total products sold and total customers in those cities.
A <- Combined %>% group_by(City) %>%
summarize(Profit = sum(Profit),
Products_Sold = sum(Quantity),
Total_Customers = n_distinct(CustomerName)) %>%
arrange(desc(Profit)) %>%
head(n=10)
Output
Plotting
Then i created visualization in R using ggplot showing top 10 cities with most profit generated.
ggplot(data = B %>% slice(1:5), aes(x=reorder(City, -Profit), y=Profit, fill=City)) +
geom_col() +
geom_line(aes(group=1), color = "white") +
geom_text(aes(label = Profit), position = position_stack(vjust = 0.5)) +
labs(title = "Top 10 Cities With Most Profit", x = "Cities") +
theme(
? ? panel.background = element_rect(fill = "black"),
? ? plot.title = element_text(hjust = 0.5, face = "bold"),
? ? panel.grid = element_blank(),
? ? plot.background = element_rect(),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold"),
? ? legend.position = "none")
Data Viz
2 - Top 10 Cities with most revenue generated
Top10CitiesWithMostRevenueGenerated <- Combined %>% group_by(City) %>%
? summarise(Revenue = sum(Amount)) %>%?
? arrange(desc(Revenue)) %>% head(n=10)?
Output
Plotting
ggplot(data = Top10CitiesWithMostRevenueGenerated, aes(x=reorder(City, -Revenue), y=Revenue, fill = City)) +
? geom_col() + labs(title = "Top 10 Cities With Most Revenue Generated", x = "Cities", y = "Revenue") +
? theme(
? ? legend.position = "none",
? ? panel.grid = element_blank(),
? ? panel.background = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )
Data Viz
3 - Total Sales and Total Orders by day of the week
C <- Combined %>%
mutate(Day = weekdays(Order.Date)) %>%
group_by(Day) %>%
summarize(Total_Orders = n(),
Total_Sales = sum(Amount))?
Output
Plotting
For plotting first of all i gave a proper order to the days of the week so they appear in an order in visualization.
C$Day <- factor(C$Day, levels=c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
Now plotting days of the week against total orders using ggplot.
ggplot(data = C, aes(x= Day, y=Total_Orders, fill = Total_Orders)) +
geom_col() +
geom_text(aes(label = Total_Orders), position = position_stack(vjust = 0.5), color = "white") +
labs(title = "Total Orders by Day of the Week", x = "Days", y = "Total Orders") +
theme(
? ? panel.background = element_blank(),
? ? plot.title = element_text(hjust = 0.5, face = "bold"),
? ? panel.grid = element_blank(),
? ? #plot.background = element_rect(fill = "skyblue"),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold"),
? ? legend.position = "none")
1st data viz
Now plotting days of the week against total sales using ggplot.
ggplot(data = C, aes(x= Day, y=Total_Sales, fill = Total_Sales)) +
geom_col() +
geom_text(aes(label = Total_Sales), position = position_stack(vjust = 0.5), color = "white") +
labs(title = "Total Sales by Day of the Week", x = "Days", y = "Total Sales") +
theme(
? ? panel.background = element_blank(),
? ? plot.title = element_text(hjust = 0.5, face = "bold"),
? ? panel.grid = element_blank(),
? ? #plot.background = element_rect(fill = "skyblue"),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold"),
? ? legend.position = "none")?
2nd data viz
4 - Calculating How Many Times the Sales Target was HIT and How Many Times the Sales Target was not HIT
Step-1
Dataframe1 <- Combined %>% mutate(Month.of.Order.Date = format(Order.Date, "%b-%y")) %>%?
? group_by(Month.of.Order.Date, Category) %>% summarize(Total = sum(Amount))
Step-2
Dataframe2 <- merge(Dataframe1, Sales_Target, by = c("Month.of.Order.Date", "Category"), all.y = TRUE)
Step-3
Dataframe2 <- Dataframe2 %>% mutate(Hit_or_Fail = ifelse(Dataframe2$Total < Dataframe2$Target, "Fail", "Hit"))
Step-4
Dataframe2$Category <- as.factor(Dataframe2$Category)
Step-5
HIT_OR_FAIL <- Dataframe2 %>% group_by(Category) %>%?
? mutate(Hit = sum(case_when(Hit_or_Fail == "Hit" ~ 1, TRUE ~ 0 )),
? ? ? ? ? ?Fail = sum(case_when(Hit_or_Fail == "Fail" ~ 1, TRUE ~ 0 ))) %>%
? ? ? ? ? ? ? ? ? ? select(Hit, Fail) %>% distinct()
Output
Plotting
For this plot, i created a stacked bar chart using ggplot. But to create stacked bar chart long-form data is more suitable so first converted output to long format from wide format.
Long_Form <- gather(HIT_OR_FAIL, Variable, Value, -Category)
Now Plotting
ggplot(data = Long_Form, aes(x = Category, y = Value, fill = Variable)) +
geom_col() +
geom_text(aes(label = Value), position = position_stack(vjust = 0.5), col="white") +
labs(title = "No. of Times Sales Target was Hit and Failed", y = "No. of Times") +
scale_fill_manual(values = c("#0000FF", "#006400")) +
? theme(
? ? panel.grid = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? panel.background = element_rect(fill = "white")
? ? ) +
? guides(fill = guide_legend(title = "Target"))
Data viz
5 - Calculating profit and quantity sold by each month
D<-Combined %>%
mutate(Month_Name = format(Order.Date, "%B")) %>% group_by(Month_Name) %>%
? summarise(Profit = sum(Profit),
? ? ? ? ? Quantity_Sold = sum(Quantity)) %>% arrange(desc(Profit))?
Output
Plotting
Now to plot data first i ordered months, i gave the same order as months appear in the calendar. This data starts from April-2018 and ends at March-2019.
D$Month_Name <- factor(D$Month_Name, levels = c("April", "May", "June", "July", "August", "September",
"October", "November", "December", "January", "February",
"March"))
Now plotting months against total quantity sold
ggplot(data = D, aes(x = Month_Name, y = Quantity_Sold)) +
geom_area(aes(group = 1), size = 1.5, fill = "steelblue") +?
geom_text(aes(label=Quantity_Sold), color = "black", face = "bold") +?
labs(title = "Total Quantity Sold Each Month", x = "Month", y = "Total Quantity Sold") +
theme(
? ? plot.title = element_text(hjust = 0.5, face = "bold"),
? ? panel.background = element_rect(fill = "white"),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )
1st data viz
Now plotting total profit against months
ggplot(data = D, aes(x = Month_Name, y = Profit)) +
? geom_col(fill = "purple") +?
? geom_text(aes(label=Profit), color = "black", position = position_stack(vjust = 0.5)) +?
? labs(title = "Total Profit Each Month", x = "Month", y = "Total Profit") +
? theme(
? ? plot.title = element_text(hjust = 0.5, face = "bold"),
? ? panel.background = element_rect(fill = "white"),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )
2nd data viz
6 - Calculating the percentage contribution of each month in total profit.
D2<- Combined %>% mutate(Month_Name = format(Order.Date, "%B")) %>% group_by(Month_Name) %>%
? ? ? ? summarise(Profit = sum(Profit),
? ? ? ? ? ? Quantity_Sold = sum(Quantity)) %>% filter(Profit >= 0) %>%?
? ? ? ? ? ? ? ? ?mutate (Percentage_of_Sales = round((Profit/sum(Profit))*100,2)) %>%?
? ? ? ? ? ? ? ? ? ? ? arrange(desc(Profit))??
Output
Now Plotting
In this viz, i am making a pie chart.
pie(D2$Percentage_of_Sales, labels = paste(D2$Month_Name,","
,round(D2$Percentage_of_Sales,2), "%"),
? ? col = rainbow(length(D2$Month_Name)), main = "Percentage of Profit by Month")
Data viz
7 - Calculating the percentage contribution of each month in total loss.
D4<- Combined %>% mutate(Month_Name = format(Order.Date, "%B")) %>%
group_by(Month_Name) %>
? ? ? ? summarise(Profit = sum(Profit),
? ? ? ? ? ? Quantity_Sold = sum(Quantity)) %>% filter(Profit <= 0) %>%?
? ? ? ? ? ? ? ? ?mutate (Percentage_of_Loss = round((Profit/sum(Profit))*100,2)) %>%?
? ? ? ? ? ? ? ? ? ? ? arrange(desc(Profit))%
Output
Now Plotting
pie(D4$Percentage_of_Loss, labels = paste(D4$Month_Name,"," ,round(D4$Percentage_of_Loss,2), "%")
? ? col = rainbow(length(D4$Month_Name)), main = "Percentage of Loss shared by Each Month"),
Data viz
8 - Heat Map of the Number of Orders with respect to Month and Day of the Week
Step-1
Adding "Month" column in the Combined data frame.
Combined <- Combined %>% mutate(Month = format(Order.Date, format = "%B"))?
Step-2
Adding "Day" column in the Combined data frame.
Combined <- Combined %>% mutate(Day = weekdays(Order.Date))
Step-3
Now creating the data frame HeatMap which will be used to make data viz.
HeatMap <- Combined %>% group_by(Month, Day) %>%
summarise(Orders = length(unique(Order.ID)))
Step-4, step-5 and step-6 are performed to make the data viz appearance understandable.
Step-4
Months are ordered in this way because data starts from April-2018 and ends on March-2019
HeatMap$Month <- factor(HeatMap$Month, levels = c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "Februaru", "March"))
Step-5
Rearranging days of the week so they appear in proper order in data viz.
HeatMap$Day <- factor(HeatMap$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
Step-6
I am checking the maximum and minimum values in the orders column because maximum and minimum values will be used in scale_fill_gradient for the intensity of the colour in heatmap.
max(HeatMap$Orders)
min(HeatMap$Orders)
Now Plotting
ggplot(data = HeatMap, aes(x=Month, y=Day, fill=Orders)) +
geom_tile() +
? scale_fill_gradient(low = "#90EE90", high = "#006400", limits = c(1,16)) +
? labs(title = "Heat Map", x = "Months", y = "Days") +
? theme(
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )
Data Viz
领英推荐
9 - Top 5 Customers who brought in the most revenue and the total number of orders they placed
Top5CustomersWithMostRevenue <- Combined %>%
group_by(CustomerName) %>%
summarise(Revenue = sum(Amount),
Orders = length(unique(Order.ID))) %>%
arrange(desc(Revenue)) %>%
head(n=5)
Output
Plotting
First, i converted data from wide format to long format because to create a dodge bar or column chart long format data is more suitable.
Long_Form_Top5CustomersWithMostRevenue <- pivot_longer(Top5CustomersWithMostRevenue,
col = -CustomerName,
names_to = "Revenue_AND_Orders",
values_to = "Values")
Now code for creating the plot.
ggplot(data = Long_Form_Top5CustomersWithMostRevenue, aes(x = reorder(CustomerName, ifelse(Revenue_AND_Orders == "Revenue", -Values, Values), FUN = min), y = Values, fill = Revenue_AND_Orders))
? geom_col(position = "dodge") +
? geom_text(aes(label = Values), position = position_dodge(width = 0.9), vjust = -0.2) +
? scale_fill_manual(values = c("#000080", "#FF0000")) + ylim(0, 9178) +
? labs(title = "Top 5 Customers who brought in most revenue and No of orders placed by them", x = "Customer Names",
? ? ? ?y = "Values(Revenue AND Orders Placed)") +
? theme(
? ? panel.background = element_blank(),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold"),
? ? panel.grid = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5)
? ) +
? guides(fill = guide_legend(title = NULL))
+
Data Viz
You must have noticed that the columns for "Orders" are very small and are hardly visible. Reason behind it is that values of orders are very low than values of revenue. Colour of columns for orders is blue.
10 - Top 5 customers who placed the most number of orders along with revenue generated by them
Top5CustomersWithMostNumberofOrders <- Combined %>%
group_by(CustomerName) %>$
summarise(Orders = length(unique((Order.ID))),
Revenue = sum(Amount)) %>%?
arrange(desc(Orders)) %>%
head(n=5)
Output
Plotting
First, i converted data from wide format to long format because to create a dodge bar or column chart long format data is more suitable.
Long_Form_Top5CustomersWithMostNumberofOrders <- pivot_longer(Top5CustomersWithMostNumberofOrders,
cols = -CustomerName,
names_to = "Orders_AND_Revenue",
values_to = "Values")
Now code for creating the plot.
ggplot(data = Long_Form_Top5CustomersWithMostNumberofOrders, aes(x = reorder(CustomerName, ifelse(Orders_AND_Revenue == "Orders", -Values, Values), FUN = min), y = Values, fill = Orders_AND_Revenue))
? geom_col(position = "dodge") +
? geom_text(aes(label = Values), position = position_dodge(width = 0.9), vjust = -0.2) +
? scale_fill_manual(values = c("#A020F0", "#8B4000")) + ylim(0, 9178) +
? labs(title = "Top 5 Customers with most orders and revenue generated by them", x = "Customer Names",
? ? ? ?y = "Values(Orders AND Revenue)") +
? theme(
? ? panel.background = element_blank(),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold"),
? ? panel.grid = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5)
? ) +
? guides(fill = guide_legend(title = NULL))+
Data Viz
You must have noticed that the columns for "Orders" are very small and are hardly visible. Reason behind it is that values of orders are very low than the values of revenue. Colour of the columns for orders is purple.
11 - Top 5 Customers who ordered the most number of items and revenue generated by them
Top5CustomersWithMostQuantity <- Combined %>%
? group_by(CustomerName) %>%?
? summarise(Total_Quantity = sum(Quantity), Revenue = sum(Amount)) %>%
? arrange(desc(Total_Quantity)) %>%
? head(n=5)?
Output
Plotting
First, i converted data from wide format to long format because to create a dodge bar or column chart long format data is more suitable.
Long_Form_Top5CustomersWithMostQuantity <- pivot_longer(Top5CustomersWithMostQuantity
? ? ? ? ? ? cols = -CustomerName, names_to = "TotalQuantity_AND_Revenue",
? ? ? ? values_to = "Values"),
Now code for creating the plot.
ggplot(data = Long_Form_Top5CustomersWithMostQuantity, aes(x = reorder(CustomerName, ifelse(TotalQuantity_AND_Revenue == "Total_Quantity", -Values, Values), FUN = min), y = Values, fill = TotalQuantity_AND_Revenue))
? geom_col(position = "dodge") +
? geom_text(aes(label = Values), position = position_dodge(width = 0.9), vjust = -0.2) +
? scale_fill_manual(values = c("#0000FF", "#006400")) +
? labs(title = "Top 5 Customers who ordered the most number of items and revenue generated by them", x = "Customer Names",
? ? ? ?y = "Values(Total Quantity AND Revenue)") +
? theme(
? ? panel.background = element_blank(),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold"),
? ? panel.grid = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5)
? ) +
? guides(fill = guide_legend(title = NULL))+
Data Viz
You must have noticed that the columns for "Total_Quantity" are very small and are hardly visible. The reason behind it is that values of "Total_Quantity are very low than the values of revenue. Colour of the columns for Total_Quantity is green.
12 - Revenue and Percentage of Revenue Generated by Each Category
RevenueByCategory <- Combined %>% group_by(Category) %>%
? summarise(Revenue = sum(Amount)) %>%
? mutate(Revenue_Percentage = round(Revenue/sum(Revenue)*100,2)) %>% arrange(desc(Revenue_Percentage))?
Output
Plotting
pie(RevenueByCategory$Revenue_Percentage,
paste(RevenueByCategory$Category,",",RevenueByCategory$Revenue_Percentage,"%"),
col = rainbow(length(RevenueByCategory$Category)),
main = "Percentage of Revenue Generated by Each Category")
Data Viz
13 - Revenue, Profit, Loss and Profit Percentage contribution by each sub-category of Clothing
Clothing_Category_Revenue_Profit_ProfitPercentage <- Combined %>%
filter(Category == "Clothing") %>%
group_by(Sub.Category) %>%
summarise(Revenue = sum(Amount),
Profit = sum(Profit)) %>%
mutate(Profit_Percentage = round(Profit/sum(Profit)*100,2)) %>%
arrange(desc(Revenue))
Output
Plotting
ggplot(data = Clothing_Category_Revenue_Profit_ProfitPercentage,
aes(x = reorder(Sub.Category, Profit_Percentage), y = Profit_Percentage, fill = Sub.Category)) +
geom_col() + coord_flip() +
labs(title = "Percentage of Profit generated by each Sub-Category of Clothing from 32.23% original
contribution of Clothing in overall profit",?
y = "Profit Percentage", x = "Sub-Categories of Clothing") +
geom_text(aes(label = paste(Profit_Percentage,"%"), hjust = -0.25, vjust = 0.5)) +
theme(
? ? legend.position = "none",
? ? panel.grid = element_blank(),
? ? panel.background = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )?
Data Viz
14 - Revenue, Profit, Loss and Profit Percentage contribution by each sub-category of Electronics
Electronics_Category_Revenue_Profit_ProfitPercentage <-?Combined %>%
filter(Category == "Electronics") %>%
group_by(Sub.Category) %>%
summarise(Revenue = sum(Amount),
Profit = sum(Profit)) %>%
mutate(Profit_Percentage = round(Profit/sum(Profit)*100,2)) %>%
arrange(desc(Revenue))
Output
Here we have a negative value of profit which means this category is in loss. So to calculate the accurate profit percentage for each category i eliminated negative values of profit and then calculated the profit percentage for each subcategory again (values in other columns are correct).
Electronics_Profit_Percentge <- Electronics_Category_Revenue_Profit_ProfitPercentage %>%
filter(Profit >= 0) %>%
mutate(Profit_Percentage = round(Profit/sum(Profit)*100,2)) %>%
arrange(desc(Profit_Percentage))
Output
Plotting
ggplot(data = Electronics_Profit_Percentge,
? ? ? ?aes(x = reorder(Sub.Category, Profit_Percentage), y = Profit_Percentage, fill = Sub.Category)) +
? geom_col() + coord_flip() +
? labs(title = "Percentage of Profit generated by each Sub-Category
of Electronics from 38.3% original contribution of Electronics in overall profit",?
? ? ? ?y = "Profit Percentage", x = "Sub-Categories of Electronics") +
? geom_text(aes(label = paste(Profit_Percentage,"%"), hjust = -0.10, vjust = 0.5)) +
? theme(
? ? legend.position = "none",
? ? panel.grid = element_blank(),
? ? panel.background = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )?
Data Viz
15 - Revenue, Profit, Loss and Profit Percentage contribution by each sub-category of Furniture
Furniture_Category_Revenue_Profit_ProfitPercentage <- Combined %>%
filter(Category == "Furniture") %>%
group_by(Sub.Category) %>
summarise(Revenue = sum(Amount),
Profit = sum(Profit)) %>%
mutate(Profit_Percentage = round(Profit/sum(Profit)*100,2)) %>%
arrange(desc(Revenue))%
Output
Here we have a negative value of profit which means this category is in loss. So to calculate the accurate profit percentage for each category i eliminated negative values of profit and then calculated the profit percentage for each subcategory (values in other columns are correct).
Furniture_Profit_Percentge <- Furniture_Category_Revenue_Profit_ProfitPercentage %>%
filter(Profit >= 0) %>%
mutate(Profit_Percentage = round(Profit/sum(Profit)*100,2)) %>%
arrange(desc(Profit_Percentage))
Output
Plotting
ggplot(data = Furniture_Profit_Percentge,
? ? ? ?aes(x = reorder(Sub.Category, Profit_Percentage), y = Profit_Percentage, fill = Sub.Category)) +
? geom_col() + coord_flip() +
? labs(title = "Percentage of Profit generated by each Sub-Category of Furniture from 29.4% original contribution of Furniture in overall profit",?
? ? ? ?y = "Profit Percentage", x = "Sub-Categories of Furniture" ) +
? geom_text(aes(label = paste(Profit_Percentage,"%"), hjust = -0.15, vjust = 0.5)) +
? theme(
? ? legend.position = "none",
? ? panel.grid = element_blank(),
? ? panel.background = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? )?
Data Viz
16 - Top 5 States With Most Revenue along with their Profit, Total Orders and No of Customers.
Top_5_States <- Combined %>% group_by(State) %>%
summarise(Revenue = sum(Amount),
Profit = sum(Profit),
Total_Orders = length(unique(Order.ID)),
No_of_Customers = length(unique(CustomerName))) %>%
arrange(desc(Revenue)) %>% head(n=5)
Output
This data is in wide form and i want to create a dodge column chart and to create that long format data is more suitable so i am converting this wide form data to long form.
Long_Form_Top_5_States <- pivot_longer(Top_5_States, cols = -State,
names_to = "Variables",
values_to = "Values")
Plotting
As data has been converted to long-form so now i will create the plot.
ggplot(data = Long_Form_Top_5_States, aes(x = reorder(State, -Values), y = Values, fill = Variables))
geom_col(position = "dodge") + labs(title = "Top 5 States With Most Revenue along with their Profit, Total Orders and No of Custmers", x = "States", y = "Values") +
geom_text(aes(label = Values), position = position_dodge(width = 0.9), vjust = -0.75, hjust = 0.50) +
guides(fill = guide_legend(title = NULL)) +?
? scale_fill_manual(labels = c("No of Customers", "Profit", "Revenue", "Total Orders"),?
? ? ? ? ? ? ? ? ? ? values = c("red", "navyblue", "darkgreen", "purple")) +
? theme(
? ? panel.grid = element_blank(),
? ? panel.background = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? ? )+
Data Viz
As you can see in the data viz columns for No of Customers, and Total Orders are very small and are hardly visible, reason behind it is that values for these are very small as compared to values of other columns (Revenue and Profit).
17 - Comparing Sales of Current Month with sales of Last Month
Sales_Comparison <- Combined %>%
? mutate(Month = factor(Month, levels = c("April", "May", "June", "July", "August", "September", "October", "November", "December", "January", "February", "March"))) %>%
? group_by(Month) %>% summarise(Total_Sales = sum(Amount)) %>%
? mutate(Previous_Month_Sales = lag(Total_Sales))?
Notice that months are starting from April and are ending at March it's reason is that data is starting from April-2018 and is ending at March-2019, also notice that the month of April does not have any value in "Previous Month Sales" column because data is starting from April-2018.
This data is in wide format and to make dodge bar chart i am converting this data to long format.
Long_Form_Sales_Comparison <- pivot_longer(Sales_Comparison,
cols = -Month, names_to = "CurrentMonth_VS_PreviousMonth",
values_to = "Sales_Values")
Output
Plotting
ggplot(data = Long_Form_Sales_Comparison, aes(x = Month, y = Sales_Values, fill = CurrentMonth_VS_PreviousMonth)) +
? geom_col(position = "dodge") +
? scale_fill_manual(labels = c("Previous Month Sales", "Current Month Sales"), values = c("navyblue", "maroon")) +?
? labs(title = "Current Month and Previous Month Sales Comparison", x = "Months", y = "Total Sales") +
? theme(
? ? panel.background = element_blank(),
? ? panel.grid = element_blank(),
? ? plot.title = element_text(face = "bold", hjust = 0.5),
? ? axis.title.x = element_text(face = "bold"),
? ? axis.title.y = element_text(face = "bold")
? ) +
? guides(fill = guide_legend(title = NULL))
Data Viz
18 - Displaying the details (in terms of “order_date”, “order_id”, “State”, and “CustomerName”) for the first order in each state and ordering the result by “order_id”
B <- Combined %>%
group_by(State) %>%
select(Order.ID, Order.Date) %>%
mutate(Row_Number = row_number()) %>%
filter(Row_Number == 1) %>%
arrange(Order.ID)?%>%
select(-c(Row_Number))
Output
19 - Top 5 Dates on Which Most Revenue was Generated
Top5DatesOnWhichHighestAmountWasSpent <- Combined %>%
? group_by(Order.Date) %>%?
? summarise(Total_Amount = sum(Amount)) %>%
? arrange(desc(Total_Amount)) %>%
? head(n=5)?
Output
20 - Correlation and Scatter plot (showing correlation between variables) between Revenue(Amount) and Profit
cor(Combined$Amount, Combined$Profit)
Output
0.2420446
Correlation between Revenue(Amount) and Profit came out to be "0.2420446" which indicates that relation between Revenue(Amount) and Profit is not strong. Relation is positive but it is not a strong relation, it is a weak relation which indicates that increasing the revenue will not necessarily increase the profit.
Scatter Plot
plot(Combined$Amount, Combined$Profit, main = "Revenue vs Profit", xlab = "Revenue", ylab = "Profit")
? abline(lm(Combined$Profit ~ Combined$Amount), col = "red")+
21 - Correlation and Scatter plot (showing correlation between variables) between Revenue(Amount) and Quantity Ordered
cor(Combined$Amount, Combined$Quantity)
Output
0.3520518
Correlation between Revenue(Amount) and Quantity Ordered came out to be "0.3520518" which indicates that the relation between Revenue(Amount) and Profit is not strong. Relation is positive but it is not a strong relation, it is a weak relation which indicates that increasing the quantity ordered will not necessarily increase the profit.
Scatter Plot
plot(Combined$Quantity, Combined$Amount, main = "Quantity Ordered vs Revenue", xlab = "Quantity Ordered", ylab = "Revenue")
? abline(lm(Combined$Amount ~ Combined$Quantity), col = "red")+
22 - Correlation and Scatter plot (showing correlation between variables) between Profit and Quantity Ordered
cor(Combined$Profit, Combined$Quantity)
Output
0.00244136
Correlation between Profit and Quantity Ordered came out to be "0.00244136" which indicates that the relation between Quantity Ordered and Profit is not strong. Relation is positive but it is not a strong relation, rather it is a very weak relation which indicates that increasing the quantity ordered will not necessarily increase the profit.
Scatter Plot
plot(Combined$Quantity, Combined$Profit, main = "Quantity Ordered vs Profit", xlab = "Quantity Ordered", ylab = "Profit")
? abline(lm(Combined$Quantity ~ Combined$Profit), col = "red")+