Regex for SEO cheatsheet: normalize URLs, remove parameters, extract domains, subdomains, and more.

Regex for SEO cheatsheet: normalize URLs, remove parameters, extract domains, subdomains, and more.

This quick-reference shows regex snippets in both Google Big Query and Google Sheets.

SQL & regex experts can scroll to the relevant section to find snippets ready for copy&paste.

This article is structured as follows:

When working with URL data, especially from Google Analytics or various website analytics tools, each URL might have several variations due to parameters and slashes and marketing campaigns, sub tags, etc. Oftentimes we need to clean the URLs and collapse them into one row (group by, or pivot).

For example:

Our data:

URL                                                    Clicks
https://www.example.com/blog/seo-data?campaign_id=123   1,000
https://www.example.com/blog/seo-data                   7,000
https://www.example.com/blog/seo-data/                     33

But we want:

Domain             Path                                Clicks
example.com        /blog/seo-data                       8,033        

This article reviews such use-cases in depth and provides the regex & SQL snippets you need to achieve your data goals.?


EXTRACT THE DOMAIN OF A URL

This sounds like an easy task, right? Unfortunately it can be difficult to find a working regex snippet online to do this, but we will provide a few variations for you here.

Regex that removes https, http, and www

Example: https://www.example.com/path --> example.com

^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)        

Regex to include subdomain

Example: https://br.falaportugues.io/path --> br.falaportugues.io

^(?:.*://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)        

Regex to handle any protocol and remove any subdomain

Example: ftp://br.falaportugues.io/path --> falaportugues.io

^(?:.*://)?(?:[^@/\n]+@)?(?:[^.]*\.)?([^:/?\n]+)        


Google Sheet - Extract domain from URL with REGEXEXTRACT

Try this formula in your sheet, where A2 is the cell containing the URL:

=REGEXEXTRACT(A2,"^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)")        

Screenshot example:

Google Sheet regex to extract domain from url link

Notice how row6 shows that ftp protocol breaks the regex, so we modified the regex on row7 to handle any protocol and also allow subdomains, and then row8 is modified again to remove any subdomains.

Big Query - Extract domain from URL with REGEXP_EXTRACT

Try this snippet in your SQL:

regexp_extract(url, r'^(?:https?://)?(?:[^@/\n]+@)?(?:www.)?([^:/?\n]+)')        

Here is a screenshot example:

No alt text provided for this image

This example demonstrates how protocols other than http & https can break this regex, and subdomains other than www can still come through, so for those use-cases you can try the other 2 regex snippets provided at the top of this section.


EXTRACT THE PATH OF A URL

Instead of extracting the domain, sometimes we just want the path after the domain. Here are a few variations to help you:

Extract everything after .com?

Example: https://www.aa.com/en/how-to-regex?id=123 --> /en/how-to-regex?id=123

.com(.*)        

Extract everything after .com but before the parameters

Example: https://www.aa.com/en/how-to-regex?id=123 --> /en/how-to-regex

.com(.*?)\?        

Collect everything after the first encountered forward-slash with Regex Extract

Example: www.aa.io/en/how-to-regex?id=123 --> /en/how-to-regex

(\/.+)(.*?)\?        

Note that this regex solution requires the https:// to already be removed. This can be achieved with nested functions, such as:

=REGEXEXTRACT(REGEXREPLACE(A18,"^.*//",""),"(\/.+)(.*?)\?")        

In the Google Sheet example above, the regex-replace function captures the "https://" using ^.*// and then replaces it with an empty string.

Extract only the final slug, but remove any parameters

Example: www.aa.com/how-to-guides/how-to-regex?id=123 --> /how-to-regex

.+(\/.+)(.*?)\?        

Google Sheet - Extract the path from a URL with regex

I believe this nested function is the best example, because it handles more use-cases than only .com URLs...

=REGEXEXTRACT(REGEXREPLACE(A2,"^.*//",""),"(\/.+)(.*?)\?")        

...where A2 is the cell containing the URL.

No alt text provided for this image

Big Query - Extract the path from a URL with regex

If you copy&paste the snippets from above into Big Query, you might receive an error about having multiple capturing groups in your regular expression (regex).

No alt text provided for this image

I've highlighted the second capturing group (wrapped in parentheses) which is causing the error. We can fix this by merging the two groups into one group:

regexp_extract(regexp_replace(url, r'^.*//', ""), r'(\/.+.*?)\?')        
No alt text provided for this image


NORMALIZE YOUR URL DATA IN SQL / BIG QUERY

Let's say you have a database table and you need a page-level report, but each page is getting split into multiple rows due to the variations on each URL:

Our data:

URL                                                    Clicks
https://www.example.com/blog/seo-data?campaign_id=123   1,000
https://www.example.com/blog/seo-data                   4,000
https://www.example.com/blog/seo-data/                     40

But we want:

URL                                                    Clicks
example.com/blog/seo-data                               5,040        

We can fix this problem using both Regex and CASE WHEN statements.

regexp_replace(regexp_replace(CASE 
  when substr(regexp_extract(url, r'[^?]*'), -1) not like '/'
  then concat(regexp_extract(url, r'[^?]*'), '/') 
  else regexp_extract(url, r'[^?]*') 
END, r'^.*//', ''), 'www.', '') as url_scrubbed         
No alt text provided for this image

Notice I decided to normalize all URLs by giving them a trailing slash "/" because I found it easier and more intuitive to concat() a trailing slash, than to remove a trailing slash with a nested function like substr(url, 1, length(url)-1).

Wish you luck in your regex adventures!

要查看或添加评论,请登录

Shan Newton的更多文章

社区洞察

其他会员也浏览了