Regex for SEO cheatsheet: normalize URLs, remove parameters, extract domains, subdomains, and more.
This quick-reference shows regex snippets
SQL & regex experts can scroll to the relevant section to find snippets ready for copy&paste.
This article is structured as follows:
When working with URL data, especially from Google Analytics or various website analytics tools, each URL might have several variations due to parameters and slashes and marketing campaigns, sub tags, etc. Oftentimes we need to clean the URLs
For example:
Our data:
URL Clicks
https://www.example.com/blog/seo-data?campaign_id=123 1,000
https://www.example.com/blog/seo-data 7,000
https://www.example.com/blog/seo-data/ 33
But we want:
Domain Path Clicks
example.com /blog/seo-data 8,033
This article reviews such use-cases in depth and provides the regex & SQL snippets you need to achieve your data goals.?
EXTRACT THE DOMAIN OF A URL
This sounds like an easy task, right? Unfortunately it can be difficult to find a working regex snippet online to do this, but we will provide a few variations for you here.
Regex that removes https, http, and www
Example: https://www.example.com/path --> example.com
^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)
Regex to include subdomain
Example: https://br.falaportugues.io/path --> br.falaportugues.io
^(?:.*://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)
Regex to handle any protocol and remove any subdomain
Example: ftp://br.falaportugues.io/path --> falaportugues.io
^(?:.*://)?(?:[^@/\n]+@)?(?:[^.]*\.)?([^:/?\n]+)
Google Sheet - Extract domain from URL with REGEXEXTRACT
Try this formula in your sheet, where A2 is the cell containing the URL:
=REGEXEXTRACT(A2,"^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)")
Screenshot example:
Notice how row6 shows that ftp protocol breaks the regex, so we modified the regex on row7 to handle any protocol and also allow subdomains, and then row8 is modified again to remove any subdomains.
Big Query - Extract domain from URL with REGEXP_EXTRACT
Try this snippet in your SQL:
regexp_extract(url, r'^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)')
Here is a screenshot example:
This example demonstrates how protocols other than http & https can break this regex, and subdomains other than www can still come through, so for those use-cases you can try the other 2 regex snippets provided at the top of this section.
EXTRACT THE PATH OF A URL
Instead of extracting the domain, sometimes we just want the path after the domain. Here are a few variations to help you:
领英推荐
Extract everything after .com?
Example: https://www.aa.com/en/how-to-regex?id=123 --> /en/how-to-regex?id=123
.com(.*)
Extract everything after .com but before the parameters
Example: https://www.aa.com/en/how-to-regex?id=123 --> /en/how-to-regex
.com(.*?)\?
Collect everything after the first encountered forward-slash with Regex Extract
Example: www.aa.io/en/how-to-regex?id=123 --> /en/how-to-regex
(\/.+)(.*?)\?
Note that this regex solution requires the https:// to already be removed. This can be achieved with nested functions, such as:
=REGEXEXTRACT(REGEXREPLACE(A18,"^.*//",""),"(\/.+)(.*?)\?")
In the Google Sheet example above, the regex-replace function captures the "https://" using ^.*// and then replaces it with an empty string.
Extract only the final slug, but remove any parameters
Example: www.aa.com/how-to-guides/how-to-regex?id=123 --> /how-to-regex
.+(\/.+)(.*?)\?
Google Sheet - Extract the path from a URL with regex
I believe this nested function is the best example, because it handles more use-cases than only .com URLs...
=REGEXEXTRACT(REGEXREPLACE(A2,"^.*//",""),"(\/.+)(.*?)\?")
...where A2 is the cell containing the URL.
Big Query - Extract the path from a URL with regex
If you copy&paste the snippets from above into Big Query, you might receive an error about having multiple capturing groups in your regular expression (regex).
I've highlighted the second capturing group (wrapped in parentheses) which is causing the error. We can fix this by merging the two groups into one group:
regexp_extract(regexp_replace(url, r'^.*//', ""), r'(\/.+.*?)\?')
NORMALIZE YOUR URL DATA IN SQL / BIG QUERY
Let's say you have a database table and you need a page-level report, but each page is getting split into multiple rows due to the variations on each URL:
Our data:
URL Clicks
https://www.example.com/blog/seo-data?campaign_id=123 1,000
https://www.example.com/blog/seo-data 4,000
https://www.example.com/blog/seo-data/ 40
But we want:
URL Clicks
example.com/blog/seo-data 5,040
We can fix this problem using both Regex and CASE WHEN statements.
regexp_replace(regexp_replace(CASE
when substr(regexp_extract(url, r'[^?]*'), -1) not like '/'
then concat(regexp_extract(url, r'[^?]*'), '/')
else regexp_extract(url, r'[^?]*')
END, r'^.*//', ''), 'www.', '') as url_scrubbed
Notice I decided to normalize all URLs by giving them a trailing slash "/" because I found it easier and more intuitive to concat() a trailing slash, than to remove a trailing slash with a nested function like substr(url, 1, length(url)-1).
Wish you luck in your regex adventures!