Strategic Website Planning: Using Python and NetworkX to Visualize & Compare Sitemap Trees
Bj?rn Thomsen
Marketing Lead at meshcloud.io | Accelerating B2B Market Growth | Professional in Performance Marketing & Web Analytics
Website maintenance, frequent updates, or even a partial relaunch can sometimes lead to a drift from the original vision. However, in website planning, the sitemap and keywords are critical components that must remain a priority—especially if you're aiming to excel in SEO.
In the following example, I'll demonstrate how Python can be used not only to visualize a website's sitemap but also to compare it with an earlier version of the same site. This comparison will reveal changes in the site structure, highlighting what has been removed and what has been added. To make this practical, I'll use a website I'm currently working on. The objective is to represent the site structure—showing parent-child relationships—for both the older and newer versions as a sitemap tree. ??
Our Objectives
So, let’s get down to business. We’ll be using Python libraries like xml, matplotlib, urllib, collections, pydot, numpy, and most important: networkx (https://networkx.org/). If you’re missing any of these libraries, just go ahead and install them using pip install ....
We’ll refer to our sitemaps as sitemap1.xml and sitemap2.xml. You can grab these from your web hoster, or even directly from your web browser.
Alright, let's start coding... ??
Step 1: Importing Required Libraries & Parsing Sitemaps
We begin by parsing the XML files of the two sitemaps using Python's xml.etree.ElementTree module. The ET.parse() function reads the XML files and constructs an ElementTree object, allowing us to traverse and manipulate the XML data.
For each sitemap, we extract the URLs and the associated paths. The paths are split into components (e.g., /category/subcategory/page becomes ['category', 'subcategory', 'page']).
import xml.etree.ElementTree as ET
import networkx as nx
import matplotlib.pyplot as plt
from urllib.parse import urlparse, unquote
from collections import defaultdict
import pydot
from networkx.drawing.nx_pydot import graphviz_layout
import numpy as np
import matplotlib.colors as mcolors
# Parse your first sitemap ... here it took the latest sitemap
tree = ET.parse(r'PATH/YOURSITEMAP')
root = tree.getroot()
# Parse the second sitemap ... here I took an old sitemap from my development environement
tree2 = ET.parse('/mnt/data/sitemap2.xml')
root2 = tree2.getroot()
# Extract the namespace
ns = {'ns': 'https://www.sitemaps.org/schemas/sitemap/0.9'}
Step 2: Filtering the Paths
Next, we filter out paths that contain the segments "blog" and "de" (as mentioned in the beginning) from both sitemaps. This ensures that the visualization focuses only on relevant sections of the websites, excluding any paths related to blogs or German-language pages.
# Rebuild the tree structure excluding paths that start with "blog" and "de" for sitemap 1
filtered_tree_dict = lambda: defaultdict(filtered_tree_dict)
filtered_url_tree = filtered_tree_dict()
for url in root.findall('ns:url', ns):
loc = url.find('ns:loc', ns).text
path = urlparse(loc).path.strip('/')
parts = path.split('/')
# Only add to the tree if neither "blog" nor "de" is in the path
if "blog" not in parts and "de" not in parts:
current_level = filtered_url_tree
for part in parts:
current_level = current_level[part]
# Rebuild the tree structure excluding paths that start with "blog" and "de" for sitemap 2
filtered_tree_dict2 = lambda: defaultdict(filtered_tree_dict2)
filtered_url_tree2 = filtered_tree_dict2()
for url in root2.findall('ns:url', ns):
loc = url.find('ns:loc', ns).text
path = urlparse(loc).path.strip('/')
parts = path.split('/')
# Only add to the tree if neither "blog" nor "de" is in the path
if "blog" not in parts and "de" not in parts:
current_level = filtered_url_tree2
for part in parts:
current_level = current_level[part]
Step 3: Constructing the Tree Structures
We then construct a hierarchical tree structure from the filtered paths using nested dictionaries. Each level of the tree corresponds to a segment of the path. For example, if a URL path is /category/subcategory/page, the tree structure would represent category as the parent, with subcategory as a child, and page as a child of subcategory.
We use a recursive function, add_edges_filtered(), to traverse the nested dictionaries and add edges to a directed graph (DiGraph) from the NetworkX library, representing the hierarchical relationships between the nodes.
领英推荐
# Create a new graph for the filtered tree
filtered_G2 = nx.DiGraph()
def add_edges_filtered2(d, parent=None):
for k, v in d.items():
if parent:
filtered_G2.add_edge(parent, k)
add_edges_filtered2(v, k)
add_edges_filtered2(filtered_url_tree2)
Step 4: Assigning Colors & Higlighting New and Deleted Sites
We assign colors to the top-level categories (the first segment of the path) to visually distinguish them. Shades of teal blue are used for categories that exist in both sitemaps.
To create the color map, we iterate through the nodes of the tree and assign a color to each category. The colors are chosen from a predefined list of teal shades (Blues colormap).
Next, we compare the categories between the two sitemaps: Green is used to highlight categories that are new: they exist in the first sitemap but are missing from the second.
Red is used to highlight categories that are present in the second sitemap but not in the first. This is done by comparing the category lists from both sitemaps. For each category, if it is missing in the other sitemap, we update the corresponding nodes' color in the graph.
# Generate teal blue shades for categories
teal_shades = [mcolors.to_hex(c) for c in plt.cm.Blues(np.linspace(0.3, 1, 10))]
# Generate a map of categories for both sitemaps
def get_category_colors(filtered_G, categories, teal_shades):
color_map = []
category_color_map = {}
for node in filtered_G.nodes:
top_category = node.split('/')[0] if '/' in node else node
if top_category not in category_color_map:
category_color_map[top_category] = teal_shades[len(category_color_map) % len(teal_shades)]
color_map.append(category_color_map[top_category])
return color_map, category_color_map
color_map1, category_color_map1 = get_category_colors(filtered_G, {}, teal_shades)
color_map2, category_color_map2 = get_category_colors(filtered_G2, {}, teal_shades)
# Compare categories between sitemaps and mark categories in sitemap 1 that are not in sitemap 2 as green
for category in category_color_map1:
if category not in category_color_map2:
# Mark categories in sitemap 1 that are not in sitemap 2 as green
for node in filtered_G.nodes:
if node.startswith(category):
idx = list(filtered_G.nodes).index(node)
color_map1[idx] = 'green'
# Compare categories between sitemaps and mark categories in sitemap 2 that are not in sitemap 1 as red
for category in category_color_map2:
if category not in category_color_map1:
# Mark categories in sitemap 2 that are not in sitemap 1 as red
for node in filtered_G2.nodes:
if node.startswith(category):
idx = list(filtered_G2.nodes).index(node)
color_map2[idx] = 'red'
Step 5: Visualizing the Sitemap Trees
We create a side-by-side visualization of both sitemaps using Matplotlib. The first sitemap is displayed on the left, and the second is on the right.
The trees are drawn using the graphviz_layout() function from NetworkX, which generates a layout that visually organizes the nodes hierarchically. The layout is rotated by 90 degrees for better readability.
Each tree is plotted with the assigned colors, making it easy to identify shared, unique, and distinct categories between the two sitemaps.
# Plot both graphs side by side
fig, axes = plt.subplots(1, 2, figsize=(30, 15))
# First sitemap with unique categories in green
pos1 = graphviz_layout(filtered_G, prog='dot')
pos1 = {k: (-v[1], v[0]) for k, v in pos1.items()} # Swap x and y coordinates and invert x for the opposite rotation
nx.draw(filtered_G, pos1, with_labels=True, node_size=1000, node_color=color_map1, font_size=9, font_weight="bold", edge_color="black", arrows=False, ax=axes[0])
axes[0].set_title('First Sitemap Tree Structure')
# Second sitemap with unique categories in red
pos2 = graphviz_layout(filtered_G2, prog='dot')
pos2 = {k: (-v[1], v[0]) for k, v in pos2.items()} # Swap x and y coordinates and invert x for the opposite rotation
nx.draw(filtered_G2, pos2, with_labels=True, node_size=1000, node_color=color_map2, font_size=9, font_weight="bold", edge_color="black", arrows=False, ax=axes[1])
axes[1].set_title('Second Sitemap Tree Structure')
plt.show()
The resulting visualizations clearly show the structure of both sitemaps, with unique categories in green and red, and shared categories in shades of teal blue (looks like a nice, neutral color to me). This visual comparison allows us to quickly understand the differences and overlaps between the two website structures.
Conclusion
Website and SEO managers, along with marketing directors, should always keep a close eye on how a website's sitemap evolves over time. You don't always need expensive tools to do this. I’ll admit, creating this visualization took some time and effort, but with this code—and thanks to the amazing NetworkX library—you can create practical and insightful visualizations.
In my next article, I’ll dive into how you can interpret these visualizations with important KPIs to make your sitemap measurable and actionable.
Remark: The pydot library (https://github.com/pydot/pydot) used in this practical example can cause problems. You can replaced it with networkx' custom layout functions, such as hierarchy_pos, combined with matplotlib.
Custom PLU stickers | GS1/GTIN Barcodes | Handheld Automatic Sticker Applicators | Price Gun Labels | Custom Die cut Labels
3 个月Visualizing site structure evolution is clever, insightful approach.