Web Traffic Data Analysis For Online Store.
Nasir Yusuf Ahmad
Transforming Data into Actionable Insights for Informed Decision-Making| SDG 3 & 4.| Co-Founder HealStoc | Human Capital Development Advocate | AI, ML, LLM Engineer.| Economist.| Author.| Certified Data Scientist.|
“In the end you should only measure and look at the numbers that drive action, meaning that the data tells you what you should do next.”— Alexander Peiniger.
The weekend is here again. Today, the blog will get us talking about R, R, R and R. Yes, R.
Originally, this article was a class Project at the Clickon Kaduna Data Science Fellowship Programme for the R module. My four team members and I presented on Website Traffic Analysis for an online store. We did extrapolatory data analysis, predictive analysis and created machine learning models to check how accurate the model is.
Story Time?
I remember as a child, I had difficulties pronouncing the alphabet "R". My older brother always have to insist that I pronounce it rightly. Which I eventually did pronounced well overtime.
When we started the R class, at first, I must confess, it was boring. But then I reminiscence the difficulty I had pronouncing R as a child, I did not want that to be a pattern, thus, I had to let it go and embrace the tool passionately.
Lets get started.
What is R?
Simply put, R is a language and environment for statistical computing and graphics. Some argue that R is the language of Data Science.
R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. The initial version of R was created in 1993, and it was later substantially enhanced by a large group of developers globally. The R language is open-source, meaning that its source code is freely available, and users are encouraged to contribute to its development.
Why Learn R?
According to a survey of Kagglers conducted by David Smith, he finds Python and R to be preferred tools as demonstrated below:
About 7,955 respondents participated in the survey. 76.3% of the respondents voted in favor of Python, while 59.2% of the respondents are for R. So this is an open secrete, everywhere, most data scientists love to use Python very few use R, SQL and other related languages.
However, R has some pros and cons in comparison to Python. For instance, R has has the largest data sciences communities world wide? Consider the map below:
Although R boasts a global user base exceeding two million, the focus of R activity varies across regions. The team at Rapporter who published the above map, aim to address this by amalgamating data on the geographical distribution of R Foundation members, package authors, package downloaders, and user group members.
This comprehensive dataset is used to generate a country-wide 'R usage score,' which is further adjusted for population. The resulting scores are visually represented on the map below, with red indicating lower R activity and blue signifying higher engagement.
When considering R activity per capita, the leading country might defy expectations: New Zealand, the birthplace of R, holds the second position. The following are the top 10 countries in terms of per-capita R activity:
1- Switzerland
2- New Zealand (birthplace of R)
3- Austria (home of the R Foundation)
4- Ireland
5- United States Minor Outlying Islands*
6- United States
7- Australia
8- Singapore
9- Denmark
10- United Kingdom
11- Canada.
The essence of this analogy is highlight the relative importance of globally. If you have a plan of migrating to one of the top 10 countries listed above, then you should get excellent using R in Data Science or Analysis. Because it could be the most important hoe you take to farm.
So now that we have a establish our case, lets write some codes.
We will first of all call the most libraries we will be needing. And for that, we run the following code:
library(tidyverse)
library(ggplot2)
#We create a variabe called "df"
df <- read.csv('Online_sales.csv')
#We inspect the first five rows of the variable
head(df)
Since our dataset is for website traffic analyis, we were very intrested in certain fundamentals. Such as which customer visit resulted in revenue for the firm? What do we know about the Bounce Rates and and Exit Rates? How do these variables affect the profitability chance of the firm? These and many more questions where asked and provided solutions to below in the analysis.
Now, we will proceed to define the variables in respect to the questions we asked above.
# 1. User Engagement
avg_pages_visited <- mean(df$ProductRelated)
avg_duration <- mean(df$ProductRelated_Duration)
print(avg_duration)
print(avg_pages_visited)
The above commands created two variables that check what were the average pages visited per day and the mean of the duration on the web site in seconds.
The average duration on the site is about 1194 seconds, that is about 19 minutes. While the average pages visited in the site were about 31.73.
# 2. Bounce and Exit Rates
overall_bounce_rate <- mean(df$BounceRates)
exit_rates_by_page <- tapply(df$ExitRates, df$PageValues, mean)
Also, we checked for the mean of the bounce rates; which is about 0.02. because the lesser the bounce rate, the higher the chance of visit resulting into profit.
# 3. Conversion and Revenue
conversion_rate <- sum(df$Revenue) / nrow(df)
page_value_correlation <- cor(df$PageValues, df$Revenue)
The above code calculates the conversion rate as the sum of Revenue divided by the total number of rows. While the second row, computes the correlation between PageValues and Revenue.
Below here, we will plot some of the graphs from the the syntax:
# Plot 1: Monthly Visits
ggplot(df, aes(x = Month, fill = Revenue)) +
geom_bar() +
theme_minimal() +
labs(x = "Month", y = "Visits", title = "Monthly Visits")
Output:
We also checked for monthly revenue:
# Plot 2: Monthly Revenue
ggplot(df, aes(x = Month, y = Revenue, fill = Revenue)) +
geom_bar(stat = "summary", fun = sum) +
theme_minimal() +
labs(x = "Month", y = "Revenue", title = "Monthly Revenue")
Output:
There is obviously a perfect relationship between the monthly visit and the monthly revenue from the the two charts above. The months with the highest visits was May and November. In fact, November particularly has the highest revenue. It is very easy to understand; the Black Friday happens culturally all over the world in November. Hence the traffic and the substantial increase in revenue.
Moving forward, we pick interest in the type of visitors who frequently visit the website:
# Plot 3: Visitor Types
ggplot(df, aes(x = VisitorType, fill = Revenue)) +
geom_bar() +
theme_minimal() +
labs(x = "Visitor Type", y = "Visits", title = "Visitor Types")
Output:
New visitors implies new users who just signed up for the platform. While returning visitors are existing costumers who return to the site to probably make purchases. because most revenues comes from the returning visitors as you can see above.
Type of Revenue:
ggplot(df, aes(x = Revenue, fill = Revenue)) +
geom_bar() +
ggtitle("Revenue Chart")
Output:
This is the genesis of the problem, of the over 10,000 monthly users who visited the website, only 25% of them resulted in True which implies profit was generated from there visit. The goal is to see what we can do to increase the profit margin from the other 75% who were False.
领英推荐
So, for the question of Regional Visit, OS Type and Broswer Preference, we are dropping the chart here as a result of the lack of meta data. In other words, we do not have clear information about the three variables.
Therefore, we plotted a grid plot for some variables as seen below:
library(gridExtra)
# Create a grid of multiple plots for different aspects
grid.arrange(
ggplot(df, aes(x = Revenue, y = Informational, fill = Revenue)) +
geom_boxplot() +
ggtitle("Informational Pages Visited by Revenue"),
ggplot(df, aes(x = Revenue, y = ProductRelated, fill = Revenue)) +
geom_boxplot() +
ggtitle("Product Related Pages Visited by Reveue"),
ncol = 2
)
To plot a grid plot, we must use the gridExtra package in R to create a grid of multiple plots. The gridExtra package provides functions to arrange multiple plots on a single page. The grid.arrange() function is used to arrange multiple plots on a single page.
This code creates a grid of two plots. The grid.arrange() function takes as arguments the plots that you want to arrange. In this case, it takes two ggplot() functions, which create two box plots.
The ggplot() function is used to create a box plot. The aes() function is used to map variables in your data to visual properties of the plot. In this case, x = Revenue sets the x-axis to represent the Revenue variable, y = Informational and y = ProductRelated set the y-axis to represent the Informational and ProductRelated variables, and fill = Revenue sets the fill color of the boxes to represent the Revenue variable. The geom_boxplot() function is used to create a box plot, and the ggtitle() function is used to set the title of the plot.
The ncol = 2 argument specifies that the plots should be arranged in two columns.
Output:
The above grid boxplot explain how revenue is generated from those visitors who spent more time on the informational pages. On the right hand side, it also demonstrated the relationship between product related pages and the revenue. The more time they spent on project related pages, the higher the chance of making revenue from the customer.
We also checked for the Bounce Rate and Page Value with respect to revenue:
grid.arrange(
ggplot(df, aes(x = Revenue, y = BounceRates, fill = Revenue)) +
geom_boxplot() +
ggtitle("Impact of Bounce Rates on Revenue"),
ggplot(df, aes(x = Revenue, y = PageValues, fill = Revenue)) +
geom_boxplot() +
ggtitle("Imppact of PageValues on Revenue"),
ncol=2
)
Output:
For the bounce rate, where the value is almost 0 or < 0, we see true which means there is revenue. If the value of the bounce rate is high, we see much false. Which implies, we expect revenue from customers who stayed long on the website. And also, the higher the page values, the more the revenue and vice versa.
Finally, we created two machine learning models: a liner regression model and a logistics regression model:
# Install the package if it's not already installed
if (!require(caret)) {
install.packages("caret")
}
# Load the package
library(caret)
# Split the data
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(df$Revenue, p = .8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]
The above model installed the caret library and called it for the algorithm. The set.seed() method randomly choose values for training. It then split the dataset into two: training and testing.
Then, we created a variable for the model as shown below:
# Fit a linear regression model
model <- lm(Revenue ~ ProductRelated + ProductRelated_Duration, data = train)
The linear regression had Revenue as the dependent variable and two independent variables: ProductRelated + ProductRelated_Duration.
Plotting the Linear Regression Model:
plot(model)
Output:
We then finally created a Logistic Regression Model to check for the accuracy score:
# Fit the logistic regression model
model2 <- glm(Revenue ~ ProductRelated + ProductRelated_Duration, data = train, family = "binomial")
# Generate predictions for the test set
predictions <- predict(model2, newdata = test, type = "response")
# Classify the predictions
predicted_classes <- ifelse(predictions >= 0.5, 1, 0)
# Calculate the accuracy score
accuracy <- sum(predicted_classes == test$Revenue) / nrow(test)
The above is a LRM that will print the accuracy of our prediction model base on revenue:
print(paste("Accuracy: ", accuracy))
Output:
The accuracy of our LRM is 0.84. We have an overwhelming 84% of our model predicting what will likely result to True or False revenue.
What next? Let us look at the decision tree classifiers since this is a classification problem:
# The decision tree
library(rpart)
library(rpart.plot)
decision_tree_model <- rpart(Revenue~., data= df, method = 'class')
par(mar= c(1,1,1,1)) #adjust margins
rpart.plot(decision_tree_model, main='Decision Tree Model Base on Revenue', box.palette = 'BuBn')
After importing the required dependencies, we created the DTC variable and load the required information to it then finally plot is as shown below.
Output:
From the above tree, whenever the page value is <0.94, we are sure to make a lose from the customer who visited. And when it is greater than 0.94, we are absolutely certain it will be True. Thus, there is a revenue from the visitor.
With respect to the bounce rate, The expression "81e-0" is equivalent to 81 multiplied by 10 to the power of -0. Any number raised to the power of 0 is 1. Therefore, 10 to the power of -0 is 1. So, "81e-0" simplifies to 81 multiplied by 1, which is simply 81. This implies that for any bounce rate to result to profit, it has to be >= 81 seconds. Anything less than that will result to false. If the visit falls within the stipulated months above, profit is true, else is false.
Recommendations:
1. Enrich Content: Provide high-quality, engaging content tailored to your target audience. Utilize diverse formats like text, images, videos, and infographics for visual appeal and easy consumption.
2. Enhance User Experience (UX): Design a user-friendly website with fast loading times, responsive mobile design, clear navigation menus, and a visually consistent layout.
3. Call to Action (CTA): Include clear and compelling CTAs strategically placed throughout the site, guiding visitors to desired actions. Use persuasive language for increased engagement.
4. Mobile-First Approach: Prioritize mobile-friendliness with a responsive design, adapting to different screen sizes and touch interactions due to the significant portion of mobile web traffic.
Conclusion:
The blog delves into the world of R programming language, emphasizing its significance in data science and analysis. I shares personal experiences, from childhood struggles with pronouncing the letter "R" to embracing the R language passionately during a data science project. Key points covered include the definition of R, its development history, and the global preferences for R and Python among data scientists.
The blog discusses the importance of learning R, citing its large global community, and highlights a unique perspective by presenting a map showcasing R activity worldwide. Notably, New Zealand, the birthplace of R, ranks second in per-capita R activity. The post also explores website traffic analysis, covering aspects such as user engagement, bounce rates, conversion, and revenue.
Detailed R code snippets are provided to demonstrate data analysis, visualization, and the creation of machine learning models. We showcases various plots and charts to interpret and present findings, addressing questions related to visitor behavior, revenue sources, and the impact of different factors on user engagement.
We concluded by offering recommendations for website improvement, emphasizing the importance of enriched content, enhanced user experience, compelling calls to action, and a mobile-first approach. Overall, the blog serves as an informative and practical guide, combining personal anecdotes with technical insights into the world of data science using the R language.
Embarking on your data-driven journey with R Programming is truly inspiring! ?? Remember, as Bruce Lee once said - Be like water, my friend. Adapt, learn, and grow in the world of #DataScience and #RStats. ?? Your grit and dedication light the path for others – keep pushing the boundaries of what's possible! ?? #Inspiration #GrowthMindset #TechJourney