登录查看更多内容

Regex for SEO cheatsheet: normalize URLs, remove parameters, extract domains, subdomains, and more.

Shan Newton

Data Developer Lead for Google at RedCloud Consulting

发布日期: 2022年9月13日

+ 关注

This quick-reference shows regex snippets in both Google Big Query and Google Sheets.

SQL & regex experts can scroll to the relevant section to find snippets ready for copy&paste.

This article is structured as follows:

Extract the domain from the URL
Extract the path from the URL
Normalize URLs in your database for accurate reporting

When working with URL data, especially from Google Analytics or various website analytics tools, each URL might have several variations due to parameters and slashes and marketing campaigns, sub tags, etc. Oftentimes we need to clean the URLs and collapse them into one row (group by, or pivot).

For example:

Our data:

URL                                                    Clicks
https://www.example.com/blog/seo-data?campaign_id=123   1,000
https://www.example.com/blog/seo-data                   7,000
https://www.example.com/blog/seo-data/                     33

But we want:

Domain             Path                                Clicks
example.com        /blog/seo-data                       8,033

This article reviews such use-cases in depth and provides the regex & SQL snippets you need to achieve your data goals.?

EXTRACT THE DOMAIN OF A URL

This sounds like an easy task, right? Unfortunately it can be difficult to find a working regex snippet online to do this, but we will provide a few variations for you here.

Regex that removes https, http, and www

Example: https://www.example.com/path --> example.com

^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)

Regex to include subdomain

Example: https://br.falaportugues.io/path --> br.falaportugues.io

^(?:.*://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)

Regex to handle any protocol and remove any subdomain

Example: ftp://br.falaportugues.io/path --> falaportugues.io

^(?:.*://)?(?:[^@/\n]+@)?(?:[^.]*\.)?([^:/?\n]+)

Google Sheet - Extract domain from URL with REGEXEXTRACT

Try this formula in your sheet, where A2 is the cell containing the URL:

=REGEXEXTRACT(A2,"^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)")

Screenshot example:

Google Sheet regex to extract domain from url link

Notice how row6 shows that ftp protocol breaks the regex, so we modified the regex on row7 to handle any protocol and also allow subdomains, and then row8 is modified again to remove any subdomains.

Big Query - Extract domain from URL with REGEXP_EXTRACT

Try this snippet in your SQL:

regexp_extract(url, r'^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)')

Here is a screenshot example:

This example demonstrates how protocols other than http & https can break this regex, and subdomains other than www can still come through, so for those use-cases you can try the other 2 regex snippets provided at the top of this section.

EXTRACT THE PATH OF A URL

Instead of extracting the domain, sometimes we just want the path after the domain. Here are a few variations to help you:

领英推荐

Weekly SEO News August 23-29

Digimetri 6 个月前

How We Used Big Data To Gain A Competitive Edge For…

Local SEO Guide 2 年前

Weekly SEO News November 15 - 22

Digimetri 3 个月前

Extract everything after .com?

Example: https://www.aa.com/en/how-to-regex?id=123 --> /en/how-to-regex?id=123

.com(.*)

Extract everything after .com but before the parameters

Example: https://www.aa.com/en/how-to-regex?id=123 --> /en/how-to-regex

.com(.*?)\?

Collect everything after the first encountered forward-slash with Regex Extract

Example: www.aa.io/en/how-to-regex?id=123 --> /en/how-to-regex

(\/.+)(.*?)\?

Note that this regex solution requires the https:// to already be removed. This can be achieved with nested functions, such as:

=REGEXEXTRACT(REGEXREPLACE(A18,"^.*//",""),"(\/.+)(.*?)\?")

In the Google Sheet example above, the regex-replace function captures the "https://" using ^.*// and then replaces it with an empty string.

Extract only the final slug, but remove any parameters

Example: www.aa.com/how-to-guides/how-to-regex?id=123 --> /how-to-regex

.+(\/.+)(.*?)\?

Google Sheet - Extract the path from a URL with regex

I believe this nested function is the best example, because it handles more use-cases than only .com URLs...

=REGEXEXTRACT(REGEXREPLACE(A2,"^.*//",""),"(\/.+)(.*?)\?")

...where A2 is the cell containing the URL.

Big Query - Extract the path from a URL with regex

If you copy&paste the snippets from above into Big Query, you might receive an error about having multiple capturing groups in your regular expression (regex).

I've highlighted the second capturing group (wrapped in parentheses) which is causing the error. We can fix this by merging the two groups into one group:

regexp_extract(regexp_replace(url, r'^.*//', ""), r'(\/.+.*?)\?')

NORMALIZE YOUR URL DATA IN SQL / BIG QUERY

Let's say you have a database table and you need a page-level report, but each page is getting split into multiple rows due to the variations on each URL:

Our data:

URL                                                    Clicks
https://www.example.com/blog/seo-data?campaign_id=123   1,000
https://www.example.com/blog/seo-data                   4,000
https://www.example.com/blog/seo-data/                     40

But we want:

URL                                                    Clicks
example.com/blog/seo-data                               5,040

We can fix this problem using both Regex and CASE WHEN statements.

regexp_replace(regexp_replace(CASE 
  when substr(regexp_extract(url, r'[^?]*'), -1) not like '/'
  then concat(regexp_extract(url, r'[^?]*'), '/') 
  else regexp_extract(url, r'[^?]*') 
END, r'^.*//', ''), 'www.', '') as url_scrubbed

Notice I decided to normalize all URLs by giving them a trailing slash "/" because I found it easier and more intuitive to concat() a trailing slash, than to remove a trailing slash with a nested function like substr(url, 1, length(url)-1).

Wish you luck in your regex adventures!

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Shan Newton的更多文章

Coffee: the data shows it's a radical superfood & perhaps an essential vitamin

2022年11月29日

Coffee: the data shows it's a radical superfood & perhaps an essential vitamin

Hey team caffeine -- ever done a deep dive on coffee data? Let's go. "Thinking about how caffeine can be beneficial…

2 条评论

Regex for SEO cheatsheet: normalize URLs, remove parameters, extract domains, subdomains, and more.

Shan Newton

Data Developer Lead for Google at RedCloud Consulting

EXTRACT THE DOMAIN OF A URL

Google Sheet - Extract domain from URL with REGEXEXTRACT

Big Query - Extract domain from URL with REGEXP_EXTRACT

EXTRACT THE PATH OF A URL

领英推荐

Google Sheet - Extract the path from a URL with regex

Big Query - Extract the path from a URL with regex

NORMALIZE YOUR URL DATA IN SQL / BIG QUERY

Shan Newton的更多文章

社区洞察

其他会员也浏览了

A Beginner’s Guide to Structured Data & How it Impacts SEO

Structured Data and Rich Snippets: What's the difference ???

Useful tips and resources for setting up and taking action from Google Analytics.

What is Structured Data? And Why Should You Implement It?

Going beyond the basics of data analytics and SEO

My Experience with Structured Data Carousels (Beta)

8 Tips to Rank Higher on Google Maps in 2024

The 'Top Ten' Headline Lists in Google Analytics Suck

Digital Marketers See Schema Structured Data Shifting Beyond SEO

Structured Data and SEO: What You Need to Know in 2025

EXTRACT THE DOMAIN OF A URL

Google Sheet - Extract domain from URL with REGEXEXTRACT

Big Query - Extract domain from URL with REGEXP_EXTRACT

EXTRACT THE PATH OF A URL

领英推荐

Google Sheet - Extract the path from a URL with regex

Big Query - Extract the path from a URL with regex

NORMALIZE YOUR URL DATA IN SQL / BIG QUERY

Shan Newton的更多文章

Coffee: the data shows it's a radical superfood & perhaps an essential vitamin

社区洞察

其他会员也浏览了

A Beginner’s Guide to Structured Data & How it Impacts SEO

Structured Data and Rich Snippets: What's the difference ???

Useful tips and resources for setting up and taking action from Google Analytics.

What is Structured Data? And Why Should You Implement It?

Going beyond the basics of data analytics and SEO

My Experience with Structured Data Carousels (Beta)

8 Tips to Rank Higher on Google Maps in 2024

The 'Top Ten' Headline Lists in Google Analytics Suck

Digital Marketers See Schema Structured Data Shifting Beyond SEO

Structured Data and SEO: What You Need to Know in 2025