Remove PII from Google Analytics - before its collected
Brian Clifton
Author; Founder Verified-Data.com; Former Head of Web Analytics Google (EMEA); Data Privacy Expert; PhD; Specialising in enterprise Google Analytics, GTM, Consent Management; Piwik PRO.
Essentially, I had been looking for a way to block Personally Identifiable Information (PII) hits at the collection level i.e. using GTM, before the hit is sent to Google Analytics. Why do this? Putting the obvious requirement to not gather PII to one side, if you are adding filters to your GA Views in order to delete PII it is too late – the problem has already occurred. That is, if you have already sent the personal data to Google, then you have already broken the GDPR compliance!
Previously, by using GTM I would simply drop any hits containing page URLs with an @ symbol i.e. in case the URL contained an email address. Apart from being quite blunt (not all URLs with an @ symbol contain an email address), this approach would not tackle email addresses being present in other hit types e.g. events, e-commerce data etc. It also did not tackle other PII types – such as telephone numbers, zip codes, usernames etc.
In my latest post, I extend the original method developed by +Simo Ahava (using GTM’s new customTask feature) by building out the regex more – for a more sophisticated email detection, and to capture other PII types…
https://bit.ly/2fdKG16
CEO chez Commanders Act
7 年The #GDPR put pressure both on the brand and on the solution. A brand must avoid to collect PII without informed consent, a software editor like Google should block PII data collection and at least anonymize all PII data stored in his system.
Third sector digital, data and technology transformation
7 年Useful solution also for situations where the PII is getting into urls via a packaged application's admin functions (e.g. managing users in a CMS), where it isn't possible to address the problem at source, but there are still benefits in recording the traffic in GA.
Global Analytics Product Owner @ PUMA | Digital Analytics, GA4, & Privacy-First Strategies | Partnering with Teams Across 25+ Regions
7 年these are still good solutions for big companies with slow dev cycles. you have to start from somewhere. these things dont happen over time and this makes a good interim solution until the module(s) are fixed for good. it is easy to say just dont have it in the url :) sometimes it is what it is.
Founder @ UseItBetter, Analytics & Conversion Rate Optimisation
7 年Hey Brian - I can only echo your observations. Still, alerting IT about PII in URLs based on GA doesn't solve a problem. By the time you see anything in GA, the unmasked PII data is already logged by third-party servers (inc. GA servers). Why not use your script proactively to prevent loading URLs with PII? It's better to take down a part of a website rather than risk getting huge fines for unauthorised processing/leaking of PII data.
Author; Founder Verified-Data.com; Former Head of Web Analytics Google (EMEA); Data Privacy Expert; PhD; Specialising in enterprise Google Analytics, GTM, Consent Management; Piwik PRO.
7 年Hello ?ukasz Twardowski - you should know I am a privacy advocate and write about this a great deal.?The reality is that within a large organisation where there are many people working on the corporate website(s), both internal and external, and PII does end up in URLs. This is usually by mistake/oversight. In fact, I have found the problem effects 1 in 4 sites - more info here:?https://brianclifton.com/blog/2018/08/10/auditing-google-analytics-an-enterprise-study/ Data Governance is a huge task which until recently was simply being overlooked. Therefore, automatically redacting PII at the point of collection is a way to monitor that a site has a PII issues and, most importantly WHERE it is happening. See my comment about sending this info to the IT/dev team.