Markov Chain-Based Web User Journey Prediction
Dr. Tuhin Banik
Founder of ThatWare?, Forbes Select 200 | TEDx & BrightonSEO Speaker | Enterprise, Local & International SEO Expert | 100 Influential Tech Leaders | Innovated NLP & AI-driven SEO |Awarded Clutch Global Frontrunner in SEO
What is a Markov Chain?
In the context of websites, a Markov Chain helps us understand how users move between different pages on the site. Imagine a visitor is on the homepage of a website. Based on past user data, we can predict that this visitor will likely go to the “Services” page, “Contact” page, or any other page next. The Markov Chain doesn’t care about how the visitor arrived at the homepage; it only cares about the next step (page). This sequence of movements between pages is what a Markov Chain model captures.
How Does It Work?
Markov Chains are based on probabilities. For example, if users are on a website’s homepage, there might be a 40% chance they’ll go to the “About Us” page, a 30% chance they’ll go to the “Products” page, and a 30% chance they’ll go to “Contact Us.” These probabilities are calculated from historical data, such as how people have navigated the site.
Use Cases of Markov Chains
· ? ? ? ? Predicting User Behavior: By analyzing patterns of how people move from one webpage to another, businesses can predict what users are likely to do next. For instance, if many users who visit the “Products” page later buy something, the website can make adjustments to make this process easier.
· ? ? ? ? Web Navigation: To improve a website, you can use a Markov Chain to see how users typically navigate. It helps optimize page layouts or menus, guiding users to the most important pages.
· ? ? ? ? Other Applications:
· ? ? ? ? Stock Market Predictions: Used to predict price movements based on past trends.
· ? ? ? ? Speech Recognition: Used to predict the next word in speech based on the current word.
· ? ? ? ? Weather Forecasting: Used to predict weather patterns based on current conditions.
Real-Life Implementations
· ? ? ? ? Google’s PageRank: Google uses a form of Markov Chains to rank webpages. It calculates the probability of a user visiting a page by following links from other pages. The more links a page has, the higher the probability a user will land on it.
· ? ? ? ? Recommender Systems: Streaming services like Netflix or YouTube use Markov Chains to recommend what video or movie you might watch next based on your current viewing habits.
How Does It Predict User Behavior?
Let’s say a user is on a website’s homepage. The Markov Chain predicts the most likely page they’ll visit next by looking at the probabilities of where users typically go next (from historical data). For example, if 70% of users go to the “Products” page from the homepage, the Markov Chain will predict that for future users.
How do you collect the data for the Markov Chain model?
Step-by-step guide to data collection:
1. Tracking User Behavior on a Website:
· ? ? ? ? To build a Markov Chain, you need data on how users navigate your website. This includes information like which pages they visit and in what order. You do not need to know the URLs explicitly, but you need the sequence of page visits.
· ? ? ? ? Tools like Google Analytics, Hotjar, or Mixpanel are commonly used to track website user behavior. These tools can give you reports on page visits, session times, and user flow between pages.
2. Set Up Analytics:
· ? ? ? ? Google Analytics: Sign up for Google Analytics, install a tracking code on the website, and configure it to collect data. It will record every page a user visits and the order in which they visit.
· ? ? ? ? User Flow Reports: In Google Analytics, a feature called “User Flow” shows how users move between pages. This is exactly the kind of data a Markov Chain model needs.
3. What kind of data do you need?
· ? ? ? ? Pages Visited: For example, Homepage → About Us → Products → Contact Us.
· ? ? ? ? Click Tracking: Which buttons or links users click on to navigate?
· ? ? ? ? Session Duration: How long users spend on each page before moving to another.
· ? ? ? ? Conversion Path: For e-commerce sites, this could be the path users follow before making a purchase.
Data Required for a Markov Chain Model
For a Markov Chain model, you need sequential data that shows the flow of user navigation across pages on the website. This data will allow you to calculate transition probabilities between pages, essential for predicting user behavior. Here’s a breakdown of the data you need:
Page Paths (Sequential Data): You need the order in which users move from one page to another. This will tell you how users typically navigate your website. For example:
· ? ? ? ? Homepage → Services → Contact Us
· ? ? ? ? Homepage → Blog → Article → Services
· ? ? ? ? Sessions: Information about user sessions, such as how many pages a user visits in one session and in what order.
· ? ? ? ? Unique Page Views: How many users viewed each page, and how often.
· ? ? ? ? User Flow Data: This shows users’ paths from one page to another during their sessions.
How to Find User Flow and Page Path Data in Google Analytics 4 (GA4)
Here’s a step-by-step guide to finding the equivalent data you need in GA4:
Step 1: Navigate to the “Pages and Screens” Report
In GA4, the Behavior Flow has been replaced with detailed page path reports that you can access via the “Pages and Screens” section.
1. On the left-hand menu, under Engagement:
This will show you the pages users visit on your website, along with engagement metrics like views, average engagement time, and event count.
Step 2: Customize the Report to Show Page Transitions
The “Pages and Screens” report shows details about how users interact with specific pages. However, to better understand page transitions, follow these steps:
1. Click the “+ Add Comparison” button at the report’s top.
Step 3: Export the Data
Once you’ve customized the report to show page visits and interactions, you can export it:
1. Click the Export button at the top-right corner of the report.
2.? ? Select CSV to download the data.
Step 4: Analyzing Page Transitions
While GA4 does not provide a visual Behavior Flow like Universal Analytics, the Pages and Screens report and the Event reports can still give insights into how users move through the site, which is important for building a Markov Chain model.
Next Steps for Collecting Sequential Data:
To complete the data collection needed for the Markov Chain model:
1. Go to GA4’s Path Exploration:
· ? ? ? ? Navigate to Exploration on the left-hand side of the GA4 dashboard.
· ? ? ? ? Choose Path Exploration.
· ? ? ? ? This report will show users’ paths from one page to another.
2. Export the Path Data:
Step 4 Explanation:
In a Markov Chain model, we deal with states (also known as events) representing different actions or stages users go through. For example, on a website, typical states might be actions like:
· ? ? ? ? session_start (when a user starts browsing),
· ? ? ? ? page_view (when a user views a specific page),
· ? ? ? ? scroll (when a user scrolls down the page),
· ? ? ? ? click (when a user clicks on a button or link).
In the Path Exploration dataset, these states are listed under a column named From / To, and we need to extract this column to create our list of states. This list will be used to understand and simulate how users move between these states on the website.
Code Breakdown:
Example with Sample Data:
Let’s assume our Path Exploration Dataset looks like this:
When we run this code:
The result (states) will be:
This list now contains all the possible states (or events) users might transition through on the website. Based on the transition probabilities provided in the rest of the dataset, we will use this list to simulate how users navigate from one state to another.
What is a Transition Matrix?
Code Breakdown:
After running this code, we’ll have a transition matrix, a grid of numbers representing the probabilities of moving from one state to another.
Example with Sample Data:
Let’s assume we have the following Path Exploration Dataset:
Step-by-Step Process:
After dropping the column, the data looks like this:
What Does This Transition Matrix Represent?
Each row in the transition matrix represents one state (or action) the user is in, and each column represents the next state the user might move to. The number in each cell is the probability that the user will move from the current state to the next.
For example:
In the first row (for session_start), the probabilities are [0.00, 0.95, 0.00, 0.03, 0.01]. This means:
· ? ? ? ? There’s a 95% chance that after session_start, the user will go to page_view.
· ? ? ? ? There’s a 3% chance that after session_start, the user will go to scroll.
· ? ? ? ? There’s a 1% chance that after session_start, the user will go to click.
In the second row (for page_view), the probabilities are [0.18, 0.00, 0.80, 0.06, 0.00]. This means:
· ? ? ? ? There’s an 18% chance that after page_view, the user will go back to session_start.
· ? ? ? ? There’s an 80% chance that after page_view, the user will move to first_visit.
· ? ? ? ? There’s a 6% chance that after page_view, the user will go to scroll.
Why is this Important?
The transition matrix is the core of the Markov Chain model. It tells us the likelihood of a user moving from one state to another, allowing us to predict user behavior.
Why is this step important?
In a Markov Chain model, the transition matrix and the list of states must match in size because:
· ? ? ? ? Each row in the transition matrix represents a state (like session_start, page_view).
· ? ? ? ? Each column represents the next state a user could transition to.
· ? ? ? ? The transition matrix should have the same number of rows and columns as there are states. Otherwise, the model won’t be able to properly simulate the transitions.
This step ensures that the matrix is square (same number of rows and columns) and that it corresponds to the number of states in the list.
Code Breakdown:
1. transition_matrix.shape[0]:
2. len(states):
3. if transition_matrix.shape[0] != len(states)::
4. transition_matrix[:len(states), :len(states)]:
Step-by-Step Process:
1. Before the Check:
2. Checking the Matrix Size:
3. Adjusting the Matrix:
Now, the transition matrix has 5 rows and 5 columns, matching the 5 states (session_start, page_view, first_visit, scroll, and click).
Why is this step important?
In a Markov Chain model, each row of the transition matrix represents the probabilities of moving from one state to another. The sum of each row should equal 1, meaning the user must move to some next state. If the sum of a row is 0, it means the model doesn’t know where the user might go next, which will cause issues when simulating the user’s journey.
What does this step do?
· ? ? ? ? If a row sums to zero, the model cannot make a decision about where the user should go next. This step assigns equal probabilities to all possible states when this happens.
· ? ? ? ? If a row sums to something other than 1, we normalize it, meaning we adjust the values so that the row sums to exactly 1, representing valid probabilities.
Code Breakdown:
1. transition_matrix.sum(axis=1):
· ? ? ? ? axis=1 means we sum each row of the transition matrix. Each row represents the probabilities of moving from one state to all other states.
· ? ? ? ? For example, in one row, we might have the probabilities for moving from session_start to page_view, scroll, click, etc.
2. enumerate(transition_matrix.sum(axis=1)):
· ? ? ? ? We use enumerate to loop through the rows and get both the index (i) and the sum of the row (row_sum).
· ? ? ? ? This allows us to check if any row sums to zero.
3. if row_sum == 0::
4. transition_matrix[i] = np.ones(len(states)) / len(states):
· ? ? ? ? If a row sums to zero, we assign equal probabilities to all states in that row. We create an array of ones (np.ones(len(states))), where each value represents an equal probability, and divide by the number of states (len(states)) so that the row sums to 1.
领英推荐
· ? ? ? ? This ensures the user has an equal chance of moving to any state.
5. else: transition_matrix[i] = transition_matrix[i] / row_sum:
Example with Sample Data:
Let’s assume we have the following transition matrix before we perform Step 7:
Step-by-Step Process:
1. Calculate Row Sums:
We calculate the sum of each row:
2. Fix the Rows:
After this, Row 3 becomes:
Now, each state has a 20% chance of being visited, and the sum of the row is 1.
After normalizing, Row 5 becomes:
This means that after reaching this state, the only possible next state is the first one (100% chance).
3. Resulting Transition Matrix: After fixing and normalizing the rows, the transition matrix looks like this:
Now, all rows sum to 1, meaning the probabilities are valid for each state transition.
Why is This Important?
· ? ? ? ? If a row sums to zero, it means there’s no information about where the user will go next, which will cause the simulation to fail. We fix this by assigning equal probabilities.
· ? ? ? ? If a row sums to something other than 1, the probabilities are not valid (they should always sum to 1 in a Markov Chain). We fix this by normalizing the row, which adjusts the values so that they sum to 1.
Overview:
· ? ? ? ? The function predicts a user’s journey based on a Markov Chain model.
· ? ? ? ? It takes a starting state (like session_start), and based on the transition probabilities from the matrix, it simulates the next steps the user might take (such as page_view, scroll, etc.).
· ? ? ? ? The function repeats this process for a specified number of steps.
Step-by-Step Explanation:
1. Find the Index of the Starting State:
What It Does: The function needs to know the position (or index) of the start_state (e.g., session_start) in the states list to work with the transition matrix.
Example:
If the start_state is session_start and states = [‘session_start’, ‘page_view’, ‘scroll’, ‘click’], this will return current_state = 0 because session_start is at index 0 in the list.
2. Initialize the Journey (Tracking User’s History):
What It Does: It creates a list called state_history to keep track of the states (pages or actions) that the user visits during their journey. Initially, it contains only the start_state.
Example:
3. Loop Through the Steps:
4. Randomly Choose the Next State:
Example:
Then:
5. Add the Next State to the History:
Example:
6. Update the Current State:
Example:
Example Walkthrough:
Let’s simulate a 5-step user journey with the following setup:
· ? ? ? ? States: [‘session_start’, ‘page_view’, ‘scroll’, ‘click’]
· ? ? ? ? Transition Matrix:
1. Step 1:
2. Step 2:
3. Step 3:
4. Step 4:
5. Step 5:
When you open the CSV file, it will look like this:
· ? ? ? ? Rows: Represent the current state or event where the user is (e.g., session_start, page_view).
· ? ? ? ? Columns: Represent the next possible state or event that the user might transition to.
· ? ? ? ? Values: Represent the probabilities of transitioning from the current state (row) to the next state (column).
Interpretation of the Table:
1. Row 1: From session_start:
· ? ? ? ? 0.959596: There is a 95.96% chance that, after the session_start (when the user starts a session on the website), the user will go to the page_view state, meaning they will view a page.
· ? ? ? ? 0.030303: There is a 3.03% chance that the user will go to the scroll state after starting the session.
· ? ? ? ? 0.010101: There is a 1.01% chance that the user will immediately click on something (click) after starting the session.
· ? ? ? ? 0.000000: There is no chance (0%) that the user will directly visit the first_visit page after session_start.
2. Row 2: From page_view:
· ? ? ? ? 0.769231: There is a 76.92% chance that the user will go to first_visit after viewing a page.
· ? ? ? ? 0.173077: There is a 17.31% chance that the user will go back to session_start after viewing a page.
· ? ? ? ? 0.057692: There is a 5.77% chance that the user will scroll on the page after viewing it.
· ? ? ? ? 0.000000: There is no chance that the user will click directly from the page_view state.
3. Row 3: From first_visit:
4. Row 4: From scroll:
· ? ? ? ? 0.461538: There is a 46.15% chance that, after scrolling, the user will go to the session_start state.
· ? ? ? ? 0.384615: There is a 38.46% chance that the user will go to the page_view state after scrolling.
· ? ? ? ? 0.153846: There is a 15.38% chance that the user will click on something after scrolling.
· ? ? ? ? 0.000000: The chance of transitioning to first_visit or staying in the scroll state is zero.
5. Row 5: From click:
What Can We Understand from This?
Highly Likely Transitions:
· ? ? ? ? Users are very likely to move from session_start to page_view (95.96% chance), meaning users typically view a page soon after starting a session.
· ? ? ? ? After scrolling, users often go back to session_start (46.15%) or view another page (38.46%).
Unlikely Transitions:
Equal Probabilities in first_visit:
Simulated User Journey Output:
What does this output represent?
This output represents the path or sequence of actions a user is most likely to follow on the website, based on the data provided. Each step in the sequence (such as session_start, page_view, click, etc.) corresponds to a specific action the user performs while browsing the website.
Let’s break down each term:
session_start:
page_view:
first_visit:
click:
Step-by-Step Explanation of the Output:
session_start:
page_view:
first_visit:
session_start (again):
page_view:
session_start (repeated):
page_view (repeated):
first_visit (repeated):
click:
session_start -> page_view:
What does this mean for the website owner?
How should the website owner interpret the steps?
1. Multiple Sessions:
2. Page Views:
Users frequently view pages, but the site owner should check which pages are being viewed the most. If important pages (like product or service pages) aren’t being viewed enough, the owner might need to improve navigation or add more engaging content on those pages.
3. First-Time Visitors:
4. User Clicks:
What actions should the website owner take?
Analyze and Optimize Key Pages:
Improve the User Experience for First-Time Visitors:
Encourage More Clicks:
Session Duration:
Example Recommendations for the Website Owner:
1. Make the homepage more engaging: Add clear calls to action so users know what to do next.
2. Ensure the website is mobile-friendly: Many users may be coming from mobile devices, so it’s essential that the site looks good and is easy to navigate on mobile.
3. Optimize key landing pages: Improve the content on high-traffic pages like product or service pages to encourage users to click and explore further.
4. Use Google Analytics: Continue tracking user behavior to monitor improvements and understand how users engage with the site after making changes.
Read Full Article: https://thatware.co/markov-chain-based-web-user-journey-prediction/