Strategic Website Planning: Using Python and NetworkX to Visualize & Compare Sitemap Trees
A piece of cake: creating sitemap trees in Python and Networkx.

Strategic Website Planning: Using Python and NetworkX to Visualize & Compare Sitemap Trees

Website maintenance, frequent updates, or even a partial relaunch can sometimes lead to a drift from the original vision. However, in website planning, the sitemap and keywords are critical components that must remain a priority—especially if you're aiming to excel in SEO.

In the following example, I'll demonstrate how Python can be used not only to visualize a website's sitemap but also to compare it with an earlier version of the same site. This comparison will reveal changes in the site structure, highlighting what has been removed and what has been added. To make this practical, I'll use a website I'm currently working on. The objective is to represent the site structure—showing parent-child relationships—for both the older and newer versions as a sitemap tree. ??

Our Objectives

  • Different categories will be distinguished by various shades of blue, adding a touch of style to the visualization.
  • Categories that have been removed will be highlighted in red in the older sitemap.
  • New categories introduced in the latest sitemap will be marked in green.
  • We’re excluding /de/ and /blog/ paths to avoid redundancy and because of their substantial volume—they're not the focus here.
  • Visualizing the depth of our site tree—aiming for a maximum of two or three levels for the most critical pages (SEO).

So, let’s get down to business. We’ll be using Python libraries like xml, matplotlib, urllib, collections, pydot, numpy, and most important: networkx (https://networkx.org/). If you’re missing any of these libraries, just go ahead and install them using pip install ....

We’ll refer to our sitemaps as sitemap1.xml and sitemap2.xml. You can grab these from your web hoster, or even directly from your web browser.

Alright, let's start coding... ??

Step 1: Importing Required Libraries & Parsing Sitemaps

We begin by parsing the XML files of the two sitemaps using Python's xml.etree.ElementTree module. The ET.parse() function reads the XML files and constructs an ElementTree object, allowing us to traverse and manipulate the XML data.

For each sitemap, we extract the URLs and the associated paths. The paths are split into components (e.g., /category/subcategory/page becomes ['category', 'subcategory', 'page']).

import xml.etree.ElementTree as ET
import networkx as nx
import matplotlib.pyplot as plt
from urllib.parse import urlparse, unquote
from collections import defaultdict
import pydot
from networkx.drawing.nx_pydot import graphviz_layout
import numpy as np
import matplotlib.colors as mcolors

# Parse your first sitemap ... here it took the latest sitemap
tree = ET.parse(r'PATH/YOURSITEMAP')
root = tree.getroot()

# Parse the second sitemap ... here I took an old sitemap from my development environement
tree2 = ET.parse('/mnt/data/sitemap2.xml')
root2 = tree2.getroot()

# Extract the namespace
ns = {'ns': 'https://www.sitemaps.org/schemas/sitemap/0.9'}        

Step 2: Filtering the Paths

Next, we filter out paths that contain the segments "blog" and "de" (as mentioned in the beginning) from both sitemaps. This ensures that the visualization focuses only on relevant sections of the websites, excluding any paths related to blogs or German-language pages.

# Rebuild the tree structure excluding paths that start with "blog" and "de" for sitemap 1
filtered_tree_dict = lambda: defaultdict(filtered_tree_dict)
filtered_url_tree = filtered_tree_dict()

for url in root.findall('ns:url', ns):
    loc = url.find('ns:loc', ns).text
    path = urlparse(loc).path.strip('/')
    parts = path.split('/')
    
    # Only add to the tree if neither "blog" nor "de" is in the path
    if "blog" not in parts and "de" not in parts:
        current_level = filtered_url_tree
        for part in parts:
            current_level = current_level[part]

# Rebuild the tree structure excluding paths that start with "blog" and "de" for sitemap 2
filtered_tree_dict2 = lambda: defaultdict(filtered_tree_dict2)
filtered_url_tree2 = filtered_tree_dict2()

for url in root2.findall('ns:url', ns):
    loc = url.find('ns:loc', ns).text
    path = urlparse(loc).path.strip('/')
    parts = path.split('/')
    
    # Only add to the tree if neither "blog" nor "de" is in the path
    if "blog" not in parts and "de" not in parts:
        current_level = filtered_url_tree2
        for part in parts:
            current_level = current_level[part]        

Step 3: Constructing the Tree Structures

We then construct a hierarchical tree structure from the filtered paths using nested dictionaries. Each level of the tree corresponds to a segment of the path. For example, if a URL path is /category/subcategory/page, the tree structure would represent category as the parent, with subcategory as a child, and page as a child of subcategory.

We use a recursive function, add_edges_filtered(), to traverse the nested dictionaries and add edges to a directed graph (DiGraph) from the NetworkX library, representing the hierarchical relationships between the nodes.

# Create a new graph for the filtered tree
filtered_G2 = nx.DiGraph()

def add_edges_filtered2(d, parent=None):
    for k, v in d.items():
        if parent:
            filtered_G2.add_edge(parent, k)
        add_edges_filtered2(v, k)

add_edges_filtered2(filtered_url_tree2)        

Step 4: Assigning Colors & Higlighting New and Deleted Sites

We assign colors to the top-level categories (the first segment of the path) to visually distinguish them. Shades of teal blue are used for categories that exist in both sitemaps.

To create the color map, we iterate through the nodes of the tree and assign a color to each category. The colors are chosen from a predefined list of teal shades (Blues colormap).

Next, we compare the categories between the two sitemaps: Green is used to highlight categories that are new: they exist in the first sitemap but are missing from the second.

Red is used to highlight categories that are present in the second sitemap but not in the first. This is done by comparing the category lists from both sitemaps. For each category, if it is missing in the other sitemap, we update the corresponding nodes' color in the graph.

# Generate teal blue shades for categories
teal_shades = [mcolors.to_hex(c) for c in plt.cm.Blues(np.linspace(0.3, 1, 10))]

# Generate a map of categories for both sitemaps
def get_category_colors(filtered_G, categories, teal_shades):
    color_map = []
    category_color_map = {}
    for node in filtered_G.nodes:
        top_category = node.split('/')[0] if '/' in node else node
        if top_category not in category_color_map:
            category_color_map[top_category] = teal_shades[len(category_color_map) % len(teal_shades)]
        color_map.append(category_color_map[top_category])
    return color_map, category_color_map

color_map1, category_color_map1 = get_category_colors(filtered_G, {}, teal_shades)
color_map2, category_color_map2 = get_category_colors(filtered_G2, {}, teal_shades)

# Compare categories between sitemaps and mark categories in sitemap 1 that are not in sitemap 2 as green
for category in category_color_map1:
    if category not in category_color_map2:
        # Mark categories in sitemap 1 that are not in sitemap 2 as green
        for node in filtered_G.nodes:
            if node.startswith(category):
                idx = list(filtered_G.nodes).index(node)
                color_map1[idx] = 'green'

# Compare categories between sitemaps and mark categories in sitemap 2 that are not in sitemap 1 as red
for category in category_color_map2:
    if category not in category_color_map1:
        # Mark categories in sitemap 2 that are not in sitemap 1 as red
        for node in filtered_G2.nodes:
            if node.startswith(category):
                idx = list(filtered_G2.nodes).index(node)
                color_map2[idx] = 'red'        

Step 5: Visualizing the Sitemap Trees

We create a side-by-side visualization of both sitemaps using Matplotlib. The first sitemap is displayed on the left, and the second is on the right.

The trees are drawn using the graphviz_layout() function from NetworkX, which generates a layout that visually organizes the nodes hierarchically. The layout is rotated by 90 degrees for better readability.

Each tree is plotted with the assigned colors, making it easy to identify shared, unique, and distinct categories between the two sitemaps.

# Plot both graphs side by side
fig, axes = plt.subplots(1, 2, figsize=(30, 15))

# First sitemap with unique categories in green
pos1 = graphviz_layout(filtered_G, prog='dot')
pos1 = {k: (-v[1], v[0]) for k, v in pos1.items()}  # Swap x and y coordinates and invert x for the opposite rotation
nx.draw(filtered_G, pos1, with_labels=True, node_size=1000, node_color=color_map1, font_size=9, font_weight="bold", edge_color="black", arrows=False, ax=axes[0])
axes[0].set_title('First Sitemap Tree Structure')

# Second sitemap with unique categories in red
pos2 = graphviz_layout(filtered_G2, prog='dot')
pos2 = {k: (-v[1], v[0]) for k, v in pos2.items()}  # Swap x and y coordinates and invert x for the opposite rotation
nx.draw(filtered_G2, pos2, with_labels=True, node_size=1000, node_color=color_map2, font_size=9, font_weight="bold", edge_color="black", arrows=False, ax=axes[1])
axes[1].set_title('Second Sitemap Tree Structure')

plt.show()        


Left: the new sitemap displayed as a tree. Right: the old sitemap. Green: new nodes (sites). Red: deleted nodes.

The resulting visualizations clearly show the structure of both sitemaps, with unique categories in green and red, and shared categories in shades of teal blue (looks like a nice, neutral color to me). This visual comparison allows us to quickly understand the differences and overlaps between the two website structures.

Conclusion

Website and SEO managers, along with marketing directors, should always keep a close eye on how a website's sitemap evolves over time. You don't always need expensive tools to do this. I’ll admit, creating this visualization took some time and effort, but with this code—and thanks to the amazing NetworkX library—you can create practical and insightful visualizations.

In my next article, I’ll dive into how you can interpret these visualizations with important KPIs to make your sitemap measurable and actionable.

Remark: The pydot library (https://github.com/pydot/pydot) used in this practical example can cause problems. You can replaced it with networkx' custom layout functions, such as hierarchy_pos, combined with matplotlib.


Rob Puzio?????? ??

Custom PLU stickers | GS1/GTIN Barcodes | Handheld Automatic Sticker Applicators | Price Gun Labels | Custom Die cut Labels

3 个月

Visualizing site structure evolution is clever, insightful approach.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了