登录查看更多内容

Web Traffic Data Analysis For Online Store.

Nasir Yusuf Ahmad, ASMNES

Transforming Data into Actionable Insights for Informed Decision-Making| SDG 2, 3 & 4.| Co-Founder HealStoc | Human Capital Development Advocate | AI, ML, LLM Engineer.| Economist.| Author.| Certified Data Scientist.|

发布日期: 2023年12月2日

“In the end you should only measure and look at the numbers that drive action, meaning that the data tells you what you should do next.”— Alexander Peiniger.

The weekend is here again. Today, the blog will get us talking about R, R, R and R. Yes, R.

Originally, this article was a class Project at the Clickon Kaduna Data Science Fellowship Programme for the R module. My four team members and I presented on Website Traffic Analysis for an online store. We did extrapolatory data analysis, predictive analysis and created machine learning models to check how accurate the model is.

Story Time?

I remember as a child, I had difficulties pronouncing the alphabet "R". My older brother always have to insist that I pronounce it rightly. Which I eventually did pronounced well overtime.

When we started the R class, at first, I must confess, it was boring. But then I reminiscence the difficulty I had pronouncing R as a child, I did not want that to be a pattern, thus, I had to let it go and embrace the tool passionately.

Lets get started.

What is R?

Simply put, R is a language and environment for statistical computing and graphics. Some argue that R is the language of Data Science.

R was developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. The initial version of R was created in 1993, and it was later substantially enhanced by a large group of developers globally. The R language is open-source, meaning that its source code is freely available, and users are encouraged to contribute to its development.

Why Learn R?

According to a survey of Kagglers conducted by David Smith, he finds Python and R to be preferred tools as demonstrated below:

About 7,955 respondents participated in the survey. 76.3% of the respondents voted in favor of Python, while 59.2% of the respondents are for R. So this is an open secrete, everywhere, most data scientists love to use Python very few use R, SQL and other related languages.

However, R has some pros and cons in comparison to Python. For instance, R has has the largest data sciences communities world wide? Consider the map below:

Map of R Users World Wide Produced by Rapporter.

Although R boasts a global user base exceeding two million, the focus of R activity varies across regions. The team at Rapporter who published the above map, aim to address this by amalgamating data on the geographical distribution of R Foundation members, package authors, package downloaders, and user group members.

This comprehensive dataset is used to generate a country-wide 'R usage score,' which is further adjusted for population. The resulting scores are visually represented on the map below, with red indicating lower R activity and blue signifying higher engagement.

When considering R activity per capita, the leading country might defy expectations: New Zealand, the birthplace of R, holds the second position. The following are the top 10 countries in terms of per-capita R activity:

1- Switzerland

2- New Zealand (birthplace of R)

3- Austria (home of the R Foundation)

4- Ireland

5- United States Minor Outlying Islands*

6- United States

7- Australia

8- Singapore

9- Denmark

10- United Kingdom

11- Canada.

The essence of this analogy is highlight the relative importance of globally. If you have a plan of migrating to one of the top 10 countries listed above, then you should get excellent using R in Data Science or Analysis. Because it could be the most important hoe you take to farm.

So now that we have a establish our case, lets write some codes.

We will first of all call the most libraries we will be needing. And for that, we run the following code:

library(tidyverse)

library(ggplot2)

#We create a variabe called "df"

df <- read.csv('Online_sales.csv')

#We inspect the first five rows of the variable
head(df)

Since our dataset is for website traffic analyis, we were very intrested in certain fundamentals. Such as which customer visit resulted in revenue for the firm? What do we know about the Bounce Rates and and Exit Rates? How do these variables affect the profitability chance of the firm? These and many more questions where asked and provided solutions to below in the analysis.

Now, we will proceed to define the variables in respect to the questions we asked above.

# 1. User Engagement
avg_pages_visited <- mean(df$ProductRelated)
avg_duration <- mean(df$ProductRelated_Duration)

print(avg_duration)
print(avg_pages_visited)

The above commands created two variables that check what were the average pages visited per day and the mean of the duration on the web site in seconds.

The average duration on the site is about 1194 seconds, that is about 19 minutes. While the average pages visited in the site were about 31.73.

# 2. Bounce and Exit Rates
overall_bounce_rate <- mean(df$BounceRates)
exit_rates_by_page <- tapply(df$ExitRates, df$PageValues, mean)

Also, we checked for the mean of the bounce rates; which is about 0.02. because the lesser the bounce rate, the higher the chance of visit resulting into profit.

# 3. Conversion and Revenue
conversion_rate <- sum(df$Revenue) / nrow(df)
page_value_correlation <- cor(df$PageValues, df$Revenue)

The above code calculates the conversion rate as the sum of Revenue divided by the total number of rows. While the second row, computes the correlation between PageValues and Revenue.

Below here, we will plot some of the graphs from the the syntax:

# Plot 1: Monthly Visits
ggplot(df, aes(x = Month, fill = Revenue)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Month", y = "Visits", title = "Monthly Visits")

Output:

We also checked for monthly revenue:

# Plot 2: Monthly Revenue
ggplot(df, aes(x = Month, y = Revenue, fill = Revenue)) +
  geom_bar(stat = "summary", fun = sum) +
  theme_minimal() +
  labs(x = "Month", y = "Revenue", title = "Monthly Revenue")

Output:

There is obviously a perfect relationship between the monthly visit and the monthly revenue from the the two charts above. The months with the highest visits was May and November. In fact, November particularly has the highest revenue. It is very easy to understand; the Black Friday happens culturally all over the world in November. Hence the traffic and the substantial increase in revenue.

Moving forward, we pick interest in the type of visitors who frequently visit the website:

# Plot 3: Visitor Types
ggplot(df, aes(x = VisitorType, fill = Revenue)) +
  geom_bar() +
  theme_minimal() +
  labs(x = "Visitor Type", y = "Visits", title = "Visitor Types")

Output:

New visitors implies new users who just signed up for the platform. While returning visitors are existing costumers who return to the site to probably make purchases. because most revenues comes from the returning visitors as you can see above.

Type of Revenue:

ggplot(df, aes(x = Revenue, fill = Revenue)) +
    geom_bar() +
    ggtitle("Revenue Chart")

Output:

This is the genesis of the problem, of the over 10,000 monthly users who visited the website, only 25% of them resulted in True which implies profit was generated from there visit. The goal is to see what we can do to increase the profit margin from the other 75% who were False.

领英推荐

How to Parse JSON in Dart/Flutter: The Ultimate Guide

Abid Malik 1 年前

GraphQL in Android Jetpack compose

Riyas Pullur 2 个月前

?? Apache Pinot: How LinkedIn used a built-in Sketch…

Stanislav Kozlovski 1 年前

So, for the question of Regional Visit, OS Type and Broswer Preference, we are dropping the chart here as a result of the lack of meta data. In other words, we do not have clear information about the three variables.

Therefore, we plotted a grid plot for some variables as seen below:

library(gridExtra)


# Create a grid of multiple plots for different aspects
grid.arrange(
  ggplot(df, aes(x = Revenue, y = Informational, fill = Revenue)) +
    geom_boxplot() +
    ggtitle("Informational Pages Visited by Revenue"),
  
  ggplot(df, aes(x = Revenue, y = ProductRelated, fill = Revenue)) +
    geom_boxplot() +
    ggtitle("Product Related Pages Visited by Reveue"),
  
  
  ncol = 2
)

To plot a grid plot, we must use the gridExtra package in R to create a grid of multiple plots. The gridExtra package provides functions to arrange multiple plots on a single page. The grid.arrange() function is used to arrange multiple plots on a single page.

This code creates a grid of two plots. The grid.arrange() function takes as arguments the plots that you want to arrange. In this case, it takes two ggplot() functions, which create two box plots.

The ggplot() function is used to create a box plot. The aes() function is used to map variables in your data to visual properties of the plot. In this case, x = Revenue sets the x-axis to represent the Revenue variable, y = Informational and y = ProductRelated set the y-axis to represent the Informational and ProductRelated variables, and fill = Revenue sets the fill color of the boxes to represent the Revenue variable. The geom_boxplot() function is used to create a box plot, and the ggtitle() function is used to set the title of the plot.

The ncol = 2 argument specifies that the plots should be arranged in two columns.

Output:

The above grid boxplot explain how revenue is generated from those visitors who spent more time on the informational pages. On the right hand side, it also demonstrated the relationship between product related pages and the revenue. The more time they spent on project related pages, the higher the chance of making revenue from the customer.

We also checked for the Bounce Rate and Page Value with respect to revenue:

grid.arrange(
  ggplot(df, aes(x = Revenue, y = BounceRates, fill = Revenue)) +
    geom_boxplot() +
    ggtitle("Impact of Bounce Rates on Revenue"),
  
  ggplot(df, aes(x = Revenue, y = PageValues, fill = Revenue)) +
    geom_boxplot() +
    ggtitle("Imppact of PageValues on Revenue"),
  
  ncol=2
)

Output:

For the bounce rate, where the value is almost 0 or < 0, we see true which means there is revenue. If the value of the bounce rate is high, we see much false. Which implies, we expect revenue from customers who stayed long on the website. And also, the higher the page values, the more the revenue and vice versa.

Finally, we created two machine learning models: a liner regression model and a logistics regression model:

# Install the package if it's not already installed
if (!require(caret)) {
  install.packages("caret")
}

# Load the package
library(caret)


# Split the data
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(df$Revenue, p = .8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]

The above model installed the caret library and called it for the algorithm. The set.seed() method randomly choose values for training. It then split the dataset into two: training and testing.

Then, we created a variable for the model as shown below:

# Fit a linear regression model
model <- lm(Revenue ~ ProductRelated + ProductRelated_Duration, data = train)

The linear regression had Revenue as the dependent variable and two independent variables: ProductRelated + ProductRelated_Duration.

Plotting the Linear Regression Model:

plot(model)

Output:

We then finally created a Logistic Regression Model to check for the accuracy score:

# Fit the logistic regression model
model2 <- glm(Revenue ~ ProductRelated + ProductRelated_Duration, data = train, family = "binomial")

# Generate predictions for the test set
predictions <- predict(model2, newdata = test, type = "response")


# Classify the predictions
predicted_classes <- ifelse(predictions >= 0.5, 1, 0)

# Calculate the accuracy score
accuracy <- sum(predicted_classes == test$Revenue) / nrow(test)

The above is a LRM that will print the accuracy of our prediction model base on revenue:

print(paste("Accuracy: ", accuracy))

Output:

The accuracy of our LRM is 0.84. We have an overwhelming 84% of our model predicting what will likely result to True or False revenue.

What next? Let us look at the decision tree classifiers since this is a classification problem:

# The decision tree

library(rpart)
library(rpart.plot)

decision_tree_model <- rpart(Revenue~., data= df, method = 'class')

par(mar= c(1,1,1,1)) #adjust margins
rpart.plot(decision_tree_model, main='Decision Tree Model Base on Revenue', box.palette = 'BuBn')

After importing the required dependencies, we created the DTC variable and load the required information to it then finally plot is as shown below.

Output:

From the above tree, whenever the page value is <0.94, we are sure to make a lose from the customer who visited. And when it is greater than 0.94, we are absolutely certain it will be True. Thus, there is a revenue from the visitor.

With respect to the bounce rate, The expression "81e-0" is equivalent to 81 multiplied by 10 to the power of -0. Any number raised to the power of 0 is 1. Therefore, 10 to the power of -0 is 1. So, "81e-0" simplifies to 81 multiplied by 1, which is simply 81. This implies that for any bounce rate to result to profit, it has to be >= 81 seconds. Anything less than that will result to false. If the visit falls within the stipulated months above, profit is true, else is false.

Recommendations:

1. Enrich Content: Provide high-quality, engaging content tailored to your target audience. Utilize diverse formats like text, images, videos, and infographics for visual appeal and easy consumption.

2. Enhance User Experience (UX): Design a user-friendly website with fast loading times, responsive mobile design, clear navigation menus, and a visually consistent layout.

3. Call to Action (CTA): Include clear and compelling CTAs strategically placed throughout the site, guiding visitors to desired actions. Use persuasive language for increased engagement.

4. Mobile-First Approach: Prioritize mobile-friendliness with a responsive design, adapting to different screen sizes and touch interactions due to the significant portion of mobile web traffic.

Conclusion:

The blog delves into the world of R programming language, emphasizing its significance in data science and analysis. I shares personal experiences, from childhood struggles with pronouncing the letter "R" to embracing the R language passionately during a data science project. Key points covered include the definition of R, its development history, and the global preferences for R and Python among data scientists.

The blog discusses the importance of learning R, citing its large global community, and highlights a unique perspective by presenting a map showcasing R activity worldwide. Notably, New Zealand, the birthplace of R, ranks second in per-capita R activity. The post also explores website traffic analysis, covering aspects such as user engagement, bounce rates, conversion, and revenue.

Detailed R code snippets are provided to demonstrate data analysis, visualization, and the creation of machine learning models. We showcases various plots and charts to interpret and present findings, addressing questions related to visitor behavior, revenue sources, and the impact of different factors on user engagement.

We concluded by offering recommendations for website improvement, emphasizing the importance of enriched content, enhanced user experience, compelling calls to action, and a mobile-first approach. Overall, the blog serves as an informative and practical guide, combining personal anecdotes with technical insights into the world of data science using the R language.

ManyMangoes ??

1 年

Embarking on your data-driven journey with R Programming is truly inspiring! ?? Remember, as Bruce Lee once said - Be like water, my friend. Adapt, learn, and grow in the world of #DataScience and #RStats. ?? Your grit and dedication light the path for others – keep pushing the boundaries of what's possible! ?? #Inspiration #GrowthMindset #TechJourney

1 次回应

要查看或添加评论，请登录

Nasir Yusuf Ahmad, ASMNES的更多文章

WHAT DOES ZERO CARBON MEANS FOR THE GLOBAL SOUTH?

2025年1月12日

WHAT DOES ZERO CARBON MEANS FOR THE GLOBAL SOUTH?

People who have contributed least to climate change are most affected by it- Pew. Climate Change! Climate Change and…

2 条评论
Why OpenAI’s GPT-4o Model is Free to Use; Potential Lessons for Businesses.

2024年5月28日

Why OpenAI’s GPT-4o Model is Free to Use; Potential Lessons for Businesses.

"Marketing is about values. It’s a complicated and noisy world, and we’re not going to get a chance to get people to…

2 条评论
Are You an Eternal Optimist or an Eternal Pessimist?

2024年4月2日

Are You an Eternal Optimist or an Eternal Pessimist?

"Pessimism leads to weakness, optimism to power." -William James.

2 条评论
Malthus Revisited: Are We on the Brink of a Global Food Crisis?

2024年2月24日

Malthus Revisited: Are We on the Brink of a Global Food Crisis?

"The power of population is indefinitely greater than the power in the earth to produce subsistence for man." - Thomas…
Harnessing Data for Transformative Governance: Insights from Clickon Kaduna 2nd Guest Lecture Series

2024年2月21日

Harnessing Data for Transformative Governance: Insights from Clickon Kaduna 2nd Guest Lecture Series

"Data will talk to you if you're willing to listen." - Jim Bergeson On February 15, 2024, I had the privilege of…

2 条评论
WHAT CAN WE LEARN FROM NETFLIX DATA?

2023年11月26日

WHAT CAN WE LEARN FROM NETFLIX DATA?

"When you torture the data, it will confess to anything." -Ronald A.

8 条评论
Fostering Financial Inclusion For Sustainable Economic Development in Nigeria: Episode 1.

2023年10月8日

Fostering Financial Inclusion For Sustainable Economic Development in Nigeria: Episode 1.

"We cannot be doing the same thing and expecting different results" -Albert Einstein. Welcome to the 1st Episode of…
The Beginning of A New Dawn

2023年9月14日

The Beginning of A New Dawn

"It is always impossible until it is done." - Nelson Mandela.

4 条评论
The Rise of Indian Tech Leaders: A Journey Fueled by Vision and Growth.

2023年8月24日

The Rise of Indian Tech Leaders: A Journey Fueled by Vision and Growth.

"Our future growth relies on competitiveness and innovation, skills and productivity..

1 条评论
Fostering an African Renaissance: Pantami's Vision for Prosperity in Africa.

2023年8月14日

Fostering an African Renaissance: Pantami's Vision for Prosperity in Africa.

"The illiterates of the 21st century will not be those who cannot read and write, but those who cannot learn, relearn…

1 条评论

See all articles

Web Traffic Data Analysis For Online Store.

Nasir Yusuf Ahmad, ASMNES

Transforming Data into Actionable Insights for Informed Decision-Making| SDG 2, 3 & 4.| Co-Founder HealStoc | Human Capital Development Advocate | AI, ML, LLM Engineer.| Economist.| Author.| Certified Data Scientist.|

Story Time?

What is R?

Why Learn R?

领英推荐

Recommendations:

Conclusion:

Nasir Yusuf Ahmad, ASMNES的更多文章

社区洞察

其他会员也浏览了

Using ORMs vs Writing Raw Queries

Maximizing Data Fetching in Next.js with React Query

5,000 KDnuggets Posts – Examining Our Most Popular Analytics, Big Data, Data Science stories

Query Language: The 5 Advanced Features You Should Know

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Maptitude 2024 Released

Large Vision Model + SQL = Web Data Unlocked??

Streamlit how to guide: advanced tips for Data Scientists

Unlocking AI’s Secret Superpower: Simulate SQL Without Code!

Trying Google Gemini for Data & Code Analysis

Story Time?

What is R?

Why Learn R?

领英推荐

Recommendations:

Conclusion:

Nasir Yusuf Ahmad, ASMNES的更多文章

WHAT DOES ZERO CARBON MEANS FOR THE GLOBAL SOUTH?

Why OpenAI’s GPT-4o Model is Free to Use; Potential Lessons for Businesses.

Are You an Eternal Optimist or an Eternal Pessimist?

Malthus Revisited: Are We on the Brink of a Global Food Crisis?

Harnessing Data for Transformative Governance: Insights from Clickon Kaduna 2nd Guest Lecture Series

WHAT CAN WE LEARN FROM NETFLIX DATA?

Fostering Financial Inclusion For Sustainable Economic Development in Nigeria: Episode 1.

The Beginning of A New Dawn

The Rise of Indian Tech Leaders: A Journey Fueled by Vision and Growth.

Fostering an African Renaissance: Pantami's Vision for Prosperity in Africa.

社区洞察

其他会员也浏览了

Using ORMs vs Writing Raw Queries

Maximizing Data Fetching in Next.js with React Query

5,000 KDnuggets Posts – Examining Our Most Popular Analytics, Big Data, Data Science stories

Query Language: The 5 Advanced Features You Should Know

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Maptitude 2024 Released

Large Vision Model + SQL = Web Data Unlocked??

Streamlit how to guide: advanced tips for Data Scientists

Unlocking AI’s Secret Superpower: Simulate SQL Without Code!

Trying Google Gemini for Data & Code Analysis