ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

The Union Join (SQL)

Thomas Handorf

?? Data for Growth | Data Analyst, Data Scientist, Web Analyst and Data Engineer | Founder & CEO of 9fwr.com Data Consultancy

å‘å¸ƒæ—¥æœŸ: 2022å¹´10æœˆ4æ—¥

In SQL, Joins are the premier way to connect different data sets in data modeling. However, Joins are difficult to handle if you need to deal with missing values, null values, different granularities or incomplete knowledge. These often leads to filtering out important data, duplication of data (fan-out) or both. The Union Join helps mitigating these problems by deferring the join to the BI tool and even allows dynamically adjusting the joining granularity. (tl;dr)

The Classic JOIN: getting a full marketing funnel view

As an example we take two data sets. One contains ad statistics from a platform like facebook holding dimensions like campaign, ad set and ad_id as well as metrics like spend, impressions and clicks.

The second data set contains data from web analytics showing how many sessions and purchases where coming from certain campaigns identified by UTM parameters.

Obviously, this requires to attach UTM parameters in the campaign link-outs such that the campaign name is transmitted in utm_campaign, the ad set name is transmitted in utm_content and the ad ID is put into utm_term. With the above data set it is possible to join the two data sets and get a full funnel view on the ad performance.

SELECT
  campaign, adset, ad_id, 
  spend, impressions, clicks, sessions, purchases
FROM facebook_ads
JOIN web_analytics 
ON campaign=utm_campaign and adset=utm_content and ad_id=utm_term

This resulting table will allow you to calculate further metrics like cost per order (CPO) in the BI tool which is only possible with the combined data.

Note: you would generally add more dimensions into the join conditions, in particular date, source and medium, especially when joining more than one ad source. For keeping the example simple they are left out.

Missing, incomplete or erroneous data

The above example works well if sessions in the web analytics data match exactly with the ads in the ad data. This will basically never be the case. First, web analytics data will contain sessions not only from the ad source but also from other e.g. organic sources. Furthermore, typos in the UTM parameters or missing parameters - for example if the ad_id is not available for the UTM parameters - can impact the join.

Unfortunately, the join will fail if only one of the parameters does not match. So if the ad_id is not transmitted it would also not join on campaign even if both campaign and utm_campaign would be provided and would be equal.

Furthermore data may be intrinsically incomplete. For example a campaign may run for X days, but users from these ads may still be active on the website after the X days. When also joining on days you will miss out traffic and conversions after day X because there are no corresponding ads in the ad source for these days. In the same way an ad may not have generated any sessions on a day Y while it had spend and impressions.

The outer join

One way to deal with missing and incomplete data is to use an outer join. While this would still not match the rows it would make sure that the non-matching data is not lost. It will make the resulting table a bit messy and requires using coalesce and ifnull in the query.

é¢†è‹±æŽ¨è

3 reasons to use views instead of tables in Power BI!

Nikola Ilic 2 å¹´å‰

How to find duplicates in a table using SQL?

LearnSQL.com 2 å¹´å‰

How to Find the Minimum Value in SQL (with Examples)

StrataScratch 1 ä¸ªæœˆå‰

SELECT
  COALESCE(campaign,utm_campaign) campaign, 
  COALESCE(adset,utm_content) adset, 
  COALESCE(ad_id,utm_term) ad_id, 
  ifnull(spend,0) spend, 
  ifnull(impressions,0) impressions, 
  ifnull(clicks,0) clicks, 
  ifnull(sessions,0) sessions, 
  ifnull(purchases,0) purchases
FROM facebook_ads
FULL OUTER JOIN web_analytics 
ON campaign=utm_campaign and adset=utm_content and ad_id=utm_term

If all dimension match, it will generate the same result as above. Otherwise it will not join but provide two separate rows.

In this example the ad_id could not be provided in utm_term. The metrics from the ad source and the web analytics metrics show up on different rows. This is still useful, though, because if you load it into your BI tool, you can still group it by campaign or adset and get the right result as it will sum up all corresponding rows.

Fan-outs

It is super important that the join will always link exactly one row from the ad source with one row from the web analytics source. Otherwise you will get a fan-out which will duplicate rows from one source and inflate its metrics. For example if the web analytics source contains 2 rows for the same combination of utm_campaign, utm_content and utm_term, it will result in two rows in the result table, where each row has the metrics from the ad source. When you eventually build report in the BI tool it will report double spend, impressions and clicks for this ad.

Basically this can be avoided by doing a group by on all columns used in the join condition for each of the source tables. This will ensure that each table consist of at most one row per combination of campaign, adset and ad_id. It means that both source tables have the same granularity. However for more real life situation this may not be so easy to achieve. You may have different granularities per source, e.g. you may get facebook data on ad_id level while in LinkedIn you can only link on campaign level. This would require to group LinkedIn sessions differently than facebook sessions in the web analytics data. While this is in principle possible it is error prone and rather confusing.

Incomplete knowledge

As a special case of having different granularities we assume that we gain further knowledge about the user on the website through a survey: the user will be classified as professional or consumer. At the time the user is seeing and clicking the ad, this knowledge is still missing. While the data we receive from the ad source is as before, the web analytics is delivering the following data:

Clearly, both tables have different granularity and joining them will create two lines with inflated ad metrics. When grouping the web analytics table first on the join parameters (campaign, ad set, ad ID) the fan-out can be prevented, but we would loose the information on user_type. Doing so would prevent us from seeing important metrics like how much each user_type contributes to for example the ROAS (return on ad spend) (see contribution metrics)

The Union Join

While we have seen that incomplete matches due to erroneous or incomplete data can be mitigated by an outer join, still the issue of fan-outs persists. We have also seen that for the rows not matching in the outer join, the analysis is still possible as the BI tool will group the data eventually on the desired level. So why joining at all? This is what the union join is doing. Instead of joining we will just union both tables, putting values of the columns in the join condition into the same result columns while introducing new columns for metrics and dimensions that are different in the source tables.

SELECT
  campaign,
  adset,
  ad_id,
  null user_type,
  spend,
  impressions,
  clicks,
  0 sessions,
  0 purchases
FROM facebook_ads
UNION ALL
SELECT
  utm_campaign,
  utm_content,
  utm_term,
  user_type,
  0 spend,
  0 impressions,
  0 clicks,
  sessions,
  purchases
FROM web_analytics

The resulting table will look as follows:

As with the outer join, the BI tool will eventually group by any of the dimensions and thereby bringing the metrics from the different sources (in different rows) together. The result will be the same as if they were joined in first place. You don't even have to decide on which level you want to join before hand (i.e. on ad_id, adset or campaign level) as this can be decided in the BI tool based on the same result table.

Takeaway

In real life data situations joins often lead to missing data or data duplications. The union join is a straight forward approach to skip the joining in SQL and let the BI tool do the magic when it does the group by. You don't even have to decide on which granularity you want to join before hand.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Thomas Handorfçš„æ›´å¤šæ–‡ç«

17 types of metrics to know

2023å¹´9æœˆ20æ—¥

17 types of metrics to know

tl;dr All metrics are numbers but that doesn't mean they behave all equal when analyzing them. Knowing these types ofâ€¦

10 æ¡è¯„è®º
Multi Source Attribution (MSA): weighting online and offline channels for a 360Â° Marketing Performance view

2023å¹´5æœˆ16æ—¥

Multi Source Attribution (MSA): weighting online and offline channels for a 360Â° Marketing Performance view

Marketing Attribution today is generally based on tracking data. This approach is limited in that it can only considerâ€¦

9 æ¡è¯„è®º
Data & Analytics Predictions for 2023

2022å¹´12æœˆ29æ—¥

Data & Analytics Predictions for 2023

We at 9 friendly white rabbits consult customers in data & analytics questions and exchange with like-minded people. Asâ€¦

3 æ¡è¯„è®º
Cookieless Tracking - what works, what doesn't?

2022å¹´12æœˆ5æ—¥

Cookieless Tracking - what works, what doesn't?

Cookie-less tracking is seen as a salvation by some after cookie-based tracking is more and more banned by laws, courtâ€¦

4 æ¡è¯„è®º
Roles in Data - and whom to hire first

2022å¹´10æœˆ27æ—¥

Roles in Data - and whom to hire first

There are a few different roles in data that are dealing with different parts of the data stack. It's important toâ€¦
The ONE thing to know to understand Business Intelligence

2022å¹´10æœˆ12æ—¥

The ONE thing to know to understand Business Intelligence

Data and Analytics can be quite daunting, but if you're just starting with BI there this ONE thing, this oneâ€¦

3 æ¡è¯„è®º
How Data Contracts can solve your organizational data bottlenecks

2022å¹´9æœˆ28æ—¥

How Data Contracts can solve your organizational data bottlenecks

The central data warehouse often becomes a bottleneck in larger data driven organizations. Absurdly, this is even moreâ€¦

2 æ¡è¯„è®º
13 types of metrics to know

2022å¹´9æœˆ21æ—¥

13 types of metrics to know

tl;dr All metrics are numbers but that doesn't mean they behave all equal when analyzing them. Knowing these types ofâ€¦

10 æ¡è¯„è®º
2022 Data Analytics Predictions

2022å¹´1æœˆ18æ—¥

2022 Data Analytics Predictions

We at 9 friendly white rabbits consult customers in data & analytics questions and exchange with like-minded people. Asâ€¦

1 æ¡è¯„è®º

See all articles

The Union Join (SQL)

Thomas Handorf

?? Data for Growth | Data Analyst, Data Scientist, Web Analyst and Data Engineer | Founder & CEO of 9fwr.com Data Consultancy

The Classic JOIN: getting a full marketing funnel view

Missing, incomplete or erroneous data

The outer join

é¢†è‹±æŽ¨è

Fan-outs

Incomplete knowledge

The Union Join

Takeaway

Thomas Handorfçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Writing SQL Like a Pro: Advanced Queries Explained

Tech & Data Diary - Entry #012: Understanding Joins Through BigQuery

Unlocking Excel's Full Potential: Power Query, Pivot Tables, and Macros for Advanced Data Analysis

Utilising Power BI to connect to Open Data sources: COVID-19 Reporting

Syncari's New SQL Editor: Fast-Tracking Data to Decisions for a Competitive Edge

Clustered vs Non-Clustered Indexes

Tableau Supercharged My Data Analysis Skills

Unleashing the Power of SQL: Top 10 Projects with Datasets for Skill Enhancement

Day 20: Import vs DirectQuery vs Live Connection in Power BI

The Classic JOIN: getting a full marketing funnel view

Missing, incomplete or erroneous data

The outer join

é¢†è‹±æŽ¨è

Fan-outs

Incomplete knowledge

The Union Join

Takeaway

Thomas Handorfçš„æ›´å¤šæ–‡ç«

17 types of metrics to know

Multi Source Attribution (MSA): weighting online and offline channels for a 360Â° Marketing Performance view

Data & Analytics Predictions for 2023

Cookieless Tracking - what works, what doesn't?

Roles in Data - and whom to hire first

The ONE thing to know to understand Business Intelligence

How Data Contracts can solve your organizational data bottlenecks

13 types of metrics to know

2022 Data Analytics Predictions

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Writing SQL Like a Pro: Advanced Queries Explained

Tech & Data Diary - Entry #012: Understanding Joins Through BigQuery

Unlocking Excel's Full Potential: Power Query, Pivot Tables, and Macros for Advanced Data Analysis

Utilising Power BI to connect to Open Data sources: COVID-19 Reporting

Syncari's New SQL Editor: Fast-Tracking Data to Decisions for a Competitive Edge

Clustered vs Non-Clustered Indexes

Tableau Supercharged My Data Analysis Skills

Unleashing the Power of SQL: Top 10 Projects with Datasets for Skill Enhancement

Day 20: Import vs DirectQuery vs Live Connection in Power BI

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†