Big data, bad data; right analytics, right actions
Bigger is better, right? But bigger data is not necessarily better. In most cases, it's actually worse, because it's not being used correctly or interpreted correctly. Also, users of the data may not know the accuracy of the data or if it's been tampered with, etc. What the hell am I getting at? Over the years, many advertisers that I have helped had been advised to get log-level data, as a way to solve ad fraud. That's a nice idea in theory, but I'm here to tell you that's not necessary, why, and what to do instead.
Big data could simply be bad data
Log-level data is big data. Truly enormous data. Uber came to me in 2016 to ask for help reviewing terabytes of data they already collected to see if I could help them identify the fraud. I turned them down instantly, because there was no way to know if the data was accurate or if it had been tampered with. Years later, I was proven right, when court documents from one of the fraud cases that Uber won showed the ad tech vendor had fabricated the "log-level" data in its entirety -- "let's spin up more BS to Uber." Some vendors didn't even run any ads; they just took Uber's money and sent "transparency reports" that were entirely made up out of thin air -- falsely claiming that ads ran on mainstream sites and mobile apps.
So, having lots of data is entirely meaningless unless you had a way to know whether the data is real or not, let alone accurate or not. Having a ton of data presents many other challenges that most advertisers are not equipped to handle in the first place. How do you transport and store terabytes of data? How do you clean and standardize the data so even the most basic of charts can be made from it? What data should be used to generate insights that are pertinent to the business outcomes of the digital ad campaigns, or even to just optimize the campaigns themselves? All of the above require not only specialized tech to handle, it requires specialized people to handle -- data scientists and analytics team members.
Sadly, over the last 20 years, I've witnessed many gaps and insufficiencies. For example, the data scientists of a now defunct fraud detection firm rightly identified 100s of billions of fraudulent bid requests that appeared to come from sports domains like nfl.com, mlb.com, espn.com, dallascowboys.com, etc. But because they didn't understand how ad tech worked, they published a press release incorrectly claiming they caught a giant botnet they dubbed "Sports Bot." That botnet didn't exist, and major sites like espn and mlb were not overrun with bots. In fact there were no bots visiting those sites at all. The enormous quantities of faked bid requests were generated by python scripts on servers; no bots were needed to visit those websites. As a result of the inaccurate press release, those publishers whose sites were named got frantic calls from advertisers and agencies asking about their "exposure" to Sports Bot. This was but one example of analytics teams looking at the big data correctly, but not understanding the tech or digital advertising sufficiently -- i.e. a "gap."
Big data is not accurate or actionable
Assuming you've stuck with me thus far, let's make this concrete with actual examples from ad tech and ad fraud. Take for example the two fraud vendors' reports below. Vendor A's spreadsheet shows the four largest buckets of ad impressions are labeled "mobile in-app" where the app is not identified. Vendor B's report shows two of the largest rows marked as [tail aggregate] where the sites are not listed. What do you do with this data? Which fraudulent sites or apps do you add to your block list if they were not disclosed in your fraud vendors' reports? That's just how they like it; so you keep buying their fraud detection services, but don't have a way to optimize your campaigns by blocking bad sites and apps.
In another "makes no sense whatsoever" example, Integral Ad Science sends tags sheets with tags for every single ad creative. An advertiser that has 500 different ad creatives must have their media agency painstakingly add an IAS tag to each of the 500 creatives instead of one tag at the campaign level. Insanity, because bots don't care about the creative message in your ad; there won't be different levels of bot activity per ad creative, and even if there were, what does that mean and what action would you take if creative 1 had 0.9% fraud versus creative 2 which comes in at 0.6% fraud? Utterly useless work. But media agencies love it because it generates more billable hours, and IAS loves it because it creates the appearance of more data, but which is utterly useless and not actionable or the client.
When you buy programmatic ads, you pay for targeting, right? Most advertisers believe that programmatic ads can be targeted down to the individual cookie level -- in theory, the right ad to the right person at the right time. But what if I told you and showed you that the data is not accurate or entirely missing most of the time? Academic studies over the years have shown that basic targeting of 1 parameter -- gender -- is less accurate than random; and 2 parameters -- gender + age -- is only accurate 12% of the time (1 in 10 are accurate). The excel table to the right shows that about half of the bid requests in the bid firehose have "unknown" gender and age. If these 2 most basic of targeting parameters is half unknown, what are the chances that the hundreds of other targeting parameters and audience segments you are paying extra for are accurate? Right, just like above, these are made up out of thin air, to separate you from your money as efficiently as possible. You're not targeting the right audiences, let alone the right individual cookie. You may not even be targeting people (because bots easily pretend to be any lucrative audience segment that advertisers want to pay more for).
Right analytics leads to right actions
Let me bring this home. If bigger data, like log-level data, doesn't help and is not actionable, what should advertisers do instead? Let me show you what I do with a toolset that was originally built for my own workflow -- FouAnalytics. In 2012, when I started building FouAnalytics, I did so because I didn't trust anyone else's tech or anyone else's data. FouAnalytics was built with data accuracy at the top of mind, and anti-tampering features. You can read more about it here: Cybersecurity measures built into FouAnalytics. I opened up the platform for others to use in 2020, so they can also look at the analytics and take specific actions.
For example, site owners that already use Google Analytics or Adobe Analytics, can add FouAnalytics tags to the site to troubleshoot what GA and Adobe cannot resolve for them. In GA, you may see large spikes in quantity of traffic, but you can't see where it's coming from or what is causing it. In the FouAnalytics chart below, we color-code the spikes in dark red (bad bots) and provide supporting data so you can understand why it was marked as a bot. Note the same fingerprint (unique device+browser combo) is repeatedly hitting the page, the platform is "Linux x86_64" (server operating system, not Windows, Mac, iOS, or Android), and the user agent shows Chrome/41 (more than 60 versions out of date). Once you understand why we marked it as a bot, you can take action like block the bot or subtract the traffic from analytics to make your analytics more accurate. For more examples, see How Site-Owners Use FouAnalytics to Troubleshoot Bot Traffic.
Further, advertisers that already use DoubleVerify and IAS for fraud detection can upgrade their tools, so they can see more and do more with FouAnalytics -- 6 of 6 advertisers area already opting to do so. Getting excel spreadsheets with [tail aggregate] and low fraud numbers is fun and all, but it's not actionable because you can't see which sites and apps were fraudulently eating up your ad impressions and budgets. With FouAnalytics in-ad tags monitoring your ad impressions, you get a summary of the most fraudulent sites and apps, sorted by largest volume first. These are the sites and apps that have the largest negative impact on your campaigns, so you should review them first and decide whether to add them to your blocklist. If you agree with our labeling of various sites and apps, you check the checkbox and a list is compiled for you (like the screen shot below). You copy and paste these sites and apps into your block list, or send to your agency to do so. Like I said, this was created for my own workflow, in auditing and optimizing digital campaigns for clients. Now you can use it too. For more details, please see How to Use the Domain App Report in FouAnalytics to Review Sites and Apps.
Hopefully this shows you that having bigger data (e.g. log-level data) is not always better. But having the right analytics, which shows you 1) why something is marked a a bot, and 2) which sites and apps are fraudulent and problematic, enables you to understand and act upon the data.
As always, if you have any questions, email or DM me. If you think the above is useful and others may benefit from reading it, feel free to share out and post.
Happy (fraud) hunting and optimizing (digital) campaigns.
Senior Data Science & AI/Marketing Professional
2 年Great article...alternate title: GIGO lives!