The Union Join (SQL)

The Union Join (SQL)

In SQL, Joins are the premier way to connect different data sets in data modeling. However, Joins are difficult to handle if you need to deal with missing values, null values, different granularities or incomplete knowledge. These often leads to filtering out important data, duplication of data (fan-out) or both. The Union Join helps mitigating these problems by deferring the join to the BI tool and even allows dynamically adjusting the joining granularity. (tl;dr)

The Classic JOIN: getting a full marketing funnel view

As an example we take two data sets. One contains ad statistics from a platform like facebook holding dimensions like campaign, ad set and ad_id as well as metrics like spend, impressions and clicks.

No alt text provided for this image

The second data set contains data from web analytics showing how many sessions and purchases where coming from certain campaigns identified by UTM parameters.

No alt text provided for this image

Obviously, this requires to attach UTM parameters in the campaign link-outs such that the campaign name is transmitted in utm_campaign, the ad set name is transmitted in utm_content and the ad ID is put into utm_term. With the above data set it is possible to join the two data sets and get a full funnel view on the ad performance.

SELECT
  campaign, adset, ad_id, 
  spend, impressions, clicks, sessions, purchases
FROM facebook_ads
JOIN web_analytics 
ON campaign=utm_campaign and adset=utm_content and ad_id=utm_term
        
No alt text provided for this image

This resulting table will allow you to calculate further metrics like cost per order (CPO) in the BI tool which is only possible with the combined data.

Note: you would generally add more dimensions into the join conditions, in particular date, source and medium, especially when joining more than one ad source. For keeping the example simple they are left out.

Missing, incomplete or erroneous data

The above example works well if sessions in the web analytics data match exactly with the ads in the ad data. This will basically never be the case. First, web analytics data will contain sessions not only from the ad source but also from other e.g. organic sources. Furthermore, typos in the UTM parameters or missing parameters - for example if the ad_id is not available for the UTM parameters - can impact the join.

Unfortunately, the join will fail if only one of the parameters does not match. So if the ad_id is not transmitted it would also not join on campaign even if both campaign and utm_campaign would be provided and would be equal.

Furthermore data may be intrinsically incomplete. For example a campaign may run for X days, but users from these ads may still be active on the website after the X days. When also joining on days you will miss out traffic and conversions after day X because there are no corresponding ads in the ad source for these days. In the same way an ad may not have generated any sessions on a day Y while it had spend and impressions.

The outer join

One way to deal with missing and incomplete data is to use an outer join. While this would still not match the rows it would make sure that the non-matching data is not lost. It will make the resulting table a bit messy and requires using coalesce and ifnull in the query.

SELECT
  COALESCE(campaign,utm_campaign) campaign, 
  COALESCE(adset,utm_content) adset, 
  COALESCE(ad_id,utm_term) ad_id, 
  ifnull(spend,0) spend, 
  ifnull(impressions,0) impressions, 
  ifnull(clicks,0) clicks, 
  ifnull(sessions,0) sessions, 
  ifnull(purchases,0) purchases
FROM facebook_ads
FULL OUTER JOIN web_analytics 
ON campaign=utm_campaign and adset=utm_content and ad_id=utm_term         

If all dimension match, it will generate the same result as above. Otherwise it will not join but provide two separate rows.

No alt text provided for this image

In this example the ad_id could not be provided in utm_term. The metrics from the ad source and the web analytics metrics show up on different rows. This is still useful, though, because if you load it into your BI tool, you can still group it by campaign or adset and get the right result as it will sum up all corresponding rows.

Fan-outs

It is super important that the join will always link exactly one row from the ad source with one row from the web analytics source. Otherwise you will get a fan-out which will duplicate rows from one source and inflate its metrics. For example if the web analytics source contains 2 rows for the same combination of utm_campaign, utm_content and utm_term, it will result in two rows in the result table, where each row has the metrics from the ad source. When you eventually build report in the BI tool it will report double spend, impressions and clicks for this ad.

Basically this can be avoided by doing a group by on all columns used in the join condition for each of the source tables. This will ensure that each table consist of at most one row per combination of campaign, adset and ad_id. It means that both source tables have the same granularity. However for more real life situation this may not be so easy to achieve. You may have different granularities per source, e.g. you may get facebook data on ad_id level while in LinkedIn you can only link on campaign level. This would require to group LinkedIn sessions differently than facebook sessions in the web analytics data. While this is in principle possible it is error prone and rather confusing.

Incomplete knowledge

As a special case of having different granularities we assume that we gain further knowledge about the user on the website through a survey: the user will be classified as professional or consumer. At the time the user is seeing and clicking the ad, this knowledge is still missing. While the data we receive from the ad source is as before, the web analytics is delivering the following data:

No alt text provided for this image

Clearly, both tables have different granularity and joining them will create two lines with inflated ad metrics. When grouping the web analytics table first on the join parameters (campaign, ad set, ad ID) the fan-out can be prevented, but we would loose the information on user_type. Doing so would prevent us from seeing important metrics like how much each user_type contributes to for example the ROAS (return on ad spend) (see contribution metrics)

The Union Join

While we have seen that incomplete matches due to erroneous or incomplete data can be mitigated by an outer join, still the issue of fan-outs persists. We have also seen that for the rows not matching in the outer join, the analysis is still possible as the BI tool will group the data eventually on the desired level. So why joining at all? This is what the union join is doing. Instead of joining we will just union both tables, putting values of the columns in the join condition into the same result columns while introducing new columns for metrics and dimensions that are different in the source tables.

SELECT
  campaign,
  adset,
  ad_id,
  null user_type,
  spend,
  impressions,
  clicks,
  0 sessions,
  0 purchases
FROM facebook_ads
UNION ALL
SELECT
  utm_campaign,
  utm_content,
  utm_term,
  user_type,
  0 spend,
  0 impressions,
  0 clicks,
  sessions,
  purchases
FROM web_analytics        

The resulting table will look as follows:

No alt text provided for this image

As with the outer join, the BI tool will eventually group by any of the dimensions and thereby bringing the metrics from the different sources (in different rows) together. The result will be the same as if they were joined in first place. You don't even have to decide on which level you want to join before hand (i.e. on ad_id, adset or campaign level) as this can be decided in the BI tool based on the same result table.

Takeaway

In real life data situations joins often lead to missing data or data duplications. The union join is a straight forward approach to skip the joining in SQL and let the BI tool do the magic when it does the group by. You don't even have to decide on which granularity you want to join before hand.

要查看或添加评论,请登录

Thomas Handorf的更多文章

  • 17 types of metrics to know

    17 types of metrics to know

    tl;dr All metrics are numbers but that doesn't mean they behave all equal when analyzing them. Knowing these types of…

    10 条评论
  • Multi Source Attribution (MSA): weighting online and offline channels for a 360° Marketing Performance view

    Multi Source Attribution (MSA): weighting online and offline channels for a 360° Marketing Performance view

    Marketing Attribution today is generally based on tracking data. This approach is limited in that it can only consider…

    9 条评论
  • Data & Analytics Predictions for 2023

    Data & Analytics Predictions for 2023

    We at 9 friendly white rabbits consult customers in data & analytics questions and exchange with like-minded people. As…

    3 条评论
  • Cookieless Tracking - what works, what doesn't?

    Cookieless Tracking - what works, what doesn't?

    Cookie-less tracking is seen as a salvation by some after cookie-based tracking is more and more banned by laws, court…

    4 条评论
  • Roles in Data - and whom to hire first

    Roles in Data - and whom to hire first

    There are a few different roles in data that are dealing with different parts of the data stack. It's important to…

  • The ONE thing to know to understand Business Intelligence

    The ONE thing to know to understand Business Intelligence

    Data and Analytics can be quite daunting, but if you're just starting with BI there this ONE thing, this one…

    3 条评论
  • How Data Contracts can solve your organizational data bottlenecks

    How Data Contracts can solve your organizational data bottlenecks

    The central data warehouse often becomes a bottleneck in larger data driven organizations. Absurdly, this is even more…

    2 条评论
  • 13 types of metrics to know

    13 types of metrics to know

    tl;dr All metrics are numbers but that doesn't mean they behave all equal when analyzing them. Knowing these types of…

    10 条评论
  • 2022 Data Analytics Predictions

    2022 Data Analytics Predictions

    We at 9 friendly white rabbits consult customers in data & analytics questions and exchange with like-minded people. As…

    1 条评论

社区洞察

其他会员也浏览了