The Union Join (SQL)
Thomas Handorf
?? Data for Growth | Data Analyst, Data Scientist, Web Analyst and Data Engineer | Founder & CEO of 9fwr.com Data Consultancy
In SQL, Joins are the premier way to connect different data sets in data modeling. However, Joins are difficult to handle if you need to deal with missing values, null values, different granularities or incomplete knowledge. These often leads to filtering out important data, duplication of data (fan-out) or both. The Union Join helps mitigating these problems by deferring the join to the BI tool and even allows dynamically adjusting the joining granularity. (tl;dr)
The Classic JOIN: getting a full marketing funnel view
As an example we take two data sets. One contains ad statistics from a platform like facebook holding dimensions like campaign, ad set and ad_id as well as metrics like spend, impressions and clicks.
The second data set contains data from web analytics showing how many sessions and purchases where coming from certain campaigns identified by UTM parameters.
Obviously, this requires to attach UTM parameters in the campaign link-outs such that the campaign name is transmitted in utm_campaign, the ad set name is transmitted in utm_content and the ad ID is put into utm_term. With the above data set it is possible to join the two data sets and get a full funnel view on the ad performance.
SELECT
campaign, adset, ad_id,
spend, impressions, clicks, sessions, purchases
FROM facebook_ads
JOIN web_analytics
ON campaign=utm_campaign and adset=utm_content and ad_id=utm_term
This resulting table will allow you to calculate further metrics like cost per order (CPO) in the BI tool which is only possible with the combined data.
Note: you would generally add more dimensions into the join conditions, in particular date, source and medium, especially when joining more than one ad source. For keeping the example simple they are left out.
Missing, incomplete or erroneous data
The above example works well if sessions in the web analytics data match exactly with the ads in the ad data. This will basically never be the case. First, web analytics data will contain sessions not only from the ad source but also from other e.g. organic sources. Furthermore, typos in the UTM parameters or missing parameters - for example if the ad_id is not available for the UTM parameters - can impact the join.
Unfortunately, the join will fail if only one of the parameters does not match. So if the ad_id is not transmitted it would also not join on campaign even if both campaign and utm_campaign would be provided and would be equal.
Furthermore data may be intrinsically incomplete. For example a campaign may run for X days, but users from these ads may still be active on the website after the X days. When also joining on days you will miss out traffic and conversions after day X because there are no corresponding ads in the ad source for these days. In the same way an ad may not have generated any sessions on a day Y while it had spend and impressions.
The outer join
One way to deal with missing and incomplete data is to use an outer join. While this would still not match the rows it would make sure that the non-matching data is not lost. It will make the resulting table a bit messy and requires using coalesce and ifnull in the query.
领英推è
SELECT
COALESCE(campaign,utm_campaign) campaign,
COALESCE(adset,utm_content) adset,
COALESCE(ad_id,utm_term) ad_id,
ifnull(spend,0) spend,
ifnull(impressions,0) impressions,
ifnull(clicks,0) clicks,
ifnull(sessions,0) sessions,
ifnull(purchases,0) purchases
FROM facebook_ads
FULL OUTER JOIN web_analytics
ON campaign=utm_campaign and adset=utm_content and ad_id=utm_term
If all dimension match, it will generate the same result as above. Otherwise it will not join but provide two separate rows.
In this example the ad_id could not be provided in utm_term. The metrics from the ad source and the web analytics metrics show up on different rows. This is still useful, though, because if you load it into your BI tool, you can still group it by campaign or adset and get the right result as it will sum up all corresponding rows.
Fan-outs
It is super important that the join will always link exactly one row from the ad source with one row from the web analytics source. Otherwise you will get a fan-out which will duplicate rows from one source and inflate its metrics. For example if the web analytics source contains 2 rows for the same combination of utm_campaign, utm_content and utm_term, it will result in two rows in the result table, where each row has the metrics from the ad source. When you eventually build report in the BI tool it will report double spend, impressions and clicks for this ad.
Basically this can be avoided by doing a group by on all columns used in the join condition for each of the source tables. This will ensure that each table consist of at most one row per combination of campaign, adset and ad_id. It means that both source tables have the same granularity. However for more real life situation this may not be so easy to achieve. You may have different granularities per source, e.g. you may get facebook data on ad_id level while in LinkedIn you can only link on campaign level. This would require to group LinkedIn sessions differently than facebook sessions in the web analytics data. While this is in principle possible it is error prone and rather confusing.
Incomplete knowledge
As a special case of having different granularities we assume that we gain further knowledge about the user on the website through a survey: the user will be classified as professional or consumer. At the time the user is seeing and clicking the ad, this knowledge is still missing. While the data we receive from the ad source is as before, the web analytics is delivering the following data:
Clearly, both tables have different granularity and joining them will create two lines with inflated ad metrics. When grouping the web analytics table first on the join parameters (campaign, ad set, ad ID) the fan-out can be prevented, but we would loose the information on user_type. Doing so would prevent us from seeing important metrics like how much each user_type contributes to for example the ROAS (return on ad spend) (see contribution metrics)
The Union Join
While we have seen that incomplete matches due to erroneous or incomplete data can be mitigated by an outer join, still the issue of fan-outs persists. We have also seen that for the rows not matching in the outer join, the analysis is still possible as the BI tool will group the data eventually on the desired level. So why joining at all? This is what the union join is doing. Instead of joining we will just union both tables, putting values of the columns in the join condition into the same result columns while introducing new columns for metrics and dimensions that are different in the source tables.
SELECT
campaign,
adset,
ad_id,
null user_type,
spend,
impressions,
clicks,
0 sessions,
0 purchases
FROM facebook_ads
UNION ALL
SELECT
utm_campaign,
utm_content,
utm_term,
user_type,
0 spend,
0 impressions,
0 clicks,
sessions,
purchases
FROM web_analytics
The resulting table will look as follows:
As with the outer join, the BI tool will eventually group by any of the dimensions and thereby bringing the metrics from the different sources (in different rows) together. The result will be the same as if they were joined in first place. You don't even have to decide on which level you want to join before hand (i.e. on ad_id, adset or campaign level) as this can be decided in the BI tool based on the same result table.
Takeaway
In real life data situations joins often lead to missing data or data duplications. The union join is a straight forward approach to skip the joining in SQL and let the BI tool do the magic when it does the group by. You don't even have to decide on which granularity you want to join before hand.