Predictably Inaccurate: Big Data Brokers
John Lucker
Insurance Company Executive | Board Member | Senior Strategic and Executive Advisor | Insurtech & Business Innovator
On average, people should be more skeptical when they see numbers. They should be more willing to play around with the data themselves. – Nate Silver
If you could see all the information that companies—retailers, marketers, etc. — have about you, would you want to see it? In the digital world, we all leave footprints: details of where we go, what we do and how we do it. Every time someone accesses the internet through a computer, smart phone, iPad, or even uses a credit card another bread crumb is added to the trail. Recently, there has been much societal discussion and debate regarding this kind of data collection, as people question what corporations know about them and what they are doing with this information. With companies able to predict whether someone is pregnant solely based on shopping habits, some concern may be warranted. Concerns about privacy issues and methods for data collection and usage have been so front-and-center that in 2012, the Federal Trade Commission began an investigation of nine data brokerage companies to better understand how their data is mined and used by its customers (*1). In response to this public attention, some data brokers provided consumers limited access to a subset of their data. It was this new public data availability that led us to explore what a typical data broker knew about us and what the implications of what we found might be on the resulting Big Data analytics performed by businesses.
Marketers know less about us consumers than we would have thought
Curious about the accuracy of the consumer data collected, we requested to view our data from one of the major consumer marketing data brokers that provides limited public access to its data. We asked over 80 work colleagues from a variety of backgrounds and demographics to do the same. We then aggregated our colleagues’ analyses of their data via a survey to better understand the accuracy of the personal information that they obtained.
From the onset, one individual reported that the data broker could not locate any data under his current address and had to search using a prior state of residence; even then, his age, marital status and education were incorrect. Another person found that the data broker listed him incorrectly as a parent. Two other individuals were listed as smokers despite having never touched a cigarette.
Big Data is often touted as a panacea for modern marketing. The volume of data available has exploded in recent years, and companies are clamoring to be on the cutting edge of data-driven insights. When internal data collection is limited on marketing prospects, many companies turn to data brokers in hopes of painting a more accurate picture of consumer affinities, propensities and aversions. Since these data come with a price tag, marketers have lofty expectations of improved customer insights and bottom lines.
With all the investment, infrastructure and promise for Big Data, there is one question that surprisingly has not been asked enough:
How accurate is some of this Big Data?
Our study revealed numerous issues within the data that could impact both companies using this data and consumers whose digital identity lies within. The data is not as accurate or complete as we hoped or expected. Thirteen of the 80 participants (or 16 percent) reported no information was available at all. Despite trying multiple addresses, almost one sixth of our sample found no information. At a minimum, the data collection methods of this particular broker may be less than comprehensive.
Upon further investigation we found those participants who were unable to access their data to be predominantly foreign-born. While it’s conceivable that people who have recently come to the United States may not leave the same digital exhaust, most of those surveyed have resided in the United States for many years and actively participate in the consumer economy; owning homes and cars, raising families and shopping online.
What is the impact of incomplete consumer data? Any marketing strategy or insights based on the data risks missing a significant fraction of the population. Not only do these individuals escape the marketing net, but their absence may skew the overall picture of the potential consumer base.
While these data absences are the first issue, they are not the only concern that arose from our study. Those who did successfully obtain a view of their data found the accuracy to be mediocre. Examples of findings from our survey:
“It said I am a high school graduate (I have a PhD) and that I'm married (I am not).”
Survey participants were asked to score their data for its accuracy. The average data accuracy rating was approximately 50%, indicating that roughly half of the data points were accurate. Even more surprising: some of the inaccuracies involved the most basic demographic information—marital status, gender, and age. Take age, for example, which we expected to be highly accurate given that it was required as a key to obtaining their data. Unfortunately, this was not the case. One participant born in 1987 was listed as 69 years old. Another had a date of birth that was correct for the process to obtain their data but showed the wrong month and a day of “00” in the main demographics section of the report.
“I have owned the same home for over 9 years; they show no record of it”
Another concern was the inaccuracies related to data elements - such as those related to purchasing a residence or vehicle ownership - where a clearly documented public paper trail exists. Several respondents pointed out that data on their home, their car or both was completely wrong or out of date by several years. Certainly a car purchased in the last month may not have made it into the database, but the absence of cars purchased several years ago suggest that the data is far from current. When data related to transactions with such a lengthy paper trail is missing or outdated, inaccuracies further cloud the comprehensiveness of the data collection process. Useful marketing campaigns require current and accurate information or marketers run the risk of interacting with consumers in ways that are not relevant or timely and are potentially adverse.
“I wish this was all I spent in a year!”
Consumer expenditure data can help a company assess the amount of disposable income available for each consumer. Where and how consumers have spent in the past is indicative of where and how they may spend in the future, so marketers rely on this information to assess potential customer value and how consumers can be cross-sold or up-sold. Given the importance of this data to marketers, we expected higher quality information from the data broker. In our survey sample we found this data quite inaccurate. Respondents indicated that the broker-reported income was often substantially underestimated or outdated. Similarly, many respondents found their annual expenditures to be understated. In one case, the data report listed an individual’s yearly expenses as 458 dollars, likely capturing a few online transactions and reporting them as total annual expenditures. Given these inaccuracies in modeled income and observed spending patterns, if marketers overly rely on this data to target customers, they risk aiming at the wrong targets or drawing inappropriate conclusions.
“I appear to be interested in absolutely every category...sewing...really??“
The data brokers offer information grouped into several categories: the aforementioned housing, vehicle and spending habits as well as consumer preferences and interests. The most accurate category, according to our survey, was consumer interests. Most respondents indicated this section was accurate, but that perceived accuracy may be clouded by the nature of some of the data points. In some cases the data listed interests that were so broad or vague that they could apply to most individuals. In fact, nearly one in five of those surveyed reported that almost every category of data was marked as applicable to them while they believed such generality may apply more to their household than them individually. By listing general interests, it is possible the interests listed will be more accurate, although the specificity is likely lost. One common inaccuracy was the listing of an interest that didn’t resonate for the individual; in some cases, the listing was hypothesized to be traceable to a single purchase and in other cases, it didn’t tie to the individual at all.
Despite the inaccuracies at the individual level, these datasets still have significant predictive value at the group level.
The Law of Large Numbers suggests that as the sample size increases, the average of these samples will tend toward the mean of the population being sampled. But that does not tell the whole story. As our survey results show, the data might be directionally correct, but still inaccurate at an individual level. If the data elements such as age, homeownership, and annual spending were generated at random, then they should have little predictive value for marketers. The paradox is that many companies that are customers of data brokers successfully use these types of data to make predictions about purchasing behavior, offer responsiveness, or other consumer behavior predictions.
How can the data be inaccurate and valuable at the same time? We propose a potential solution: the data is semi-accurate. In other words, the inaccuracies are distributed at random. Big Data analytics built on the data can show predictive power overall, but many individual predictions may be highly inaccurate. The catch is that a few are accurate, which can drive lift (predictive power) in a marketing campaign. The data is somewhat correct, or there exists a reasonable correlation between the actual underlying characteristics and those maintained by the data broker. Companies using these data elements for marketing or other purposes might be getting about half the value, if the data is roughly half accurate.
It seems tantalizing how close the data brokers are to the big prize. If they could improve the quality of the data, they would have the potential to increase the value to marketers by orders of magnitude. The ability to gather and leverage detailed and accurate information about current and potential customers would allow marketers to better tailor specific advertisements and offers to consumers, target customers for targeted offers, and reduce offers to disinterested customers. As an example, if a company knew a person moved from Florida to Connecticut in September, they might offer the individual cold weather related products. In today’s fast-paced competitive marketplace, micro-marketing and micro-segmentation at the consumer level can provide a competitive advantage, and more accurate data would allow companies to fully utilize this marketing tool, rather than settling for a “better than nothing” improvement or a “spray and pray” mass-marketing approach. While the current quality from these data brokers allows for some directional insights, particularly as the sample size increases, the inaccuracies inhibit the ability to accurately target at the base consumer/household level, forcing continued reliance on less precise consumer segments and categories.
So what can users of Data Brokerage products do while the data quality remains suspect?
Some have referred to Big Data as defined by the three V’s – Volume, Velocity and Variety. We propose a fourth V – for Value – be added to the discussion. Marketers and analysts can have a large variety of data that updates frequently available to them for marketing analytics but having solely those three original V factors will not lead to informative insights and bottom line improvements if the data itself is suspect. Without Value, data brokers or collectors are merely selling directionally correct consumer segment data. We suggest a framework for customers of data brokers to consider when attempting to establish the value of consumer data.
1. Know the sources - Don’t blindly accept data. Without a thorough understanding of the origin of data and any subsequent transformations to the data it is impossible to understand the data’s true validity. Intimate knowledge of the source of the data will help to highlight potential issues and limitations within the data. How was the data collected? How much external validation has been performed on the data? At what level is the data collected? When was the data collected and how often is it updated? We have seen that data available from brokers is not guaranteed to be current. The data in question may have been updated last week, last month or last year. Our experience found that the data was not reflective of more recent life changes—marriages, home purchases/sales, car purchases/sales, births of children or degrees. These are important life events that should be on any marketer’s radar. If data for such life events is not accurate, marketers will not have a complete picture of their customers.
2. Explore the data - Exploratory data analysis is a key component of properly understanding and utilizing a dataset in any business context. Beyond the documentation, create histograms, look at correlations or heat-maps, and visualize values and basic statistics. There is no substitute for digging into a dataset with both hands to truly understand the details, and external datasets are no exception. Keep in mind that if there are inaccuracies in the data, additional exploratory analysis may be necessary. It is easy to be intimidated by thousands of columns and potentially huge numbers of records, but beyond knowing the sources, one has to get their hands dirty to start to comprehend any new data source.
3. Be practical - The famous statistician George Box said, “All models are wrong, but some are useful.” Analysis in the real world often means employing a model that predicts well, despite the fact that it may not be theoretically perfect. Is a complex model required, or can a simpler model arrive at the same answer in a fraction of the time? As data sets become bigger, a simpler model which performs nearly as well might become the optimal choice.
4. Be rational - Real-world data analysis does not occur in a vacuum. We use data to inform business problems. Despite innovations in data and analytics, experience still matters. Marketing departments should know the intricacies of the consumer base—data should supplement this understanding. If the data is suggesting something that appears completely counterintuitive it may be due to data quality. Be critical. Gather more data, conduct data quality checks, and supplement quantitative data with qualitative experiences. We are not advising anyone to fully react to the data, or simply ignore it; we are advocating the conscious consumption of data and the insights it provides.
5. Adult supervision - Hiring data scientists who understand the nuances of statistics, data and the limitations inherent with Big Data sources can greatly enhance the value of any data set. With the progress made in analytics software it might be easy for companies to invest heavily in software with hopes that applying the most advanced tools will elicit the best return on investment. We would argue that it’s not the tool that will overcome these data challenges but the person using it. A computer can only “see” so much—it can identify a pattern in the data but it cannot explain why the pattern exists. As we’ve said, no model is perfect. Companies need to focus on hiring data scientists with a deep knowledge of statistical theory and the intricacies of data analysis as well as the curiosity and entrepreneurial spirit to push their thinking towards new innovations. Companies would reap greater benefits from this consumer marketing data if they invested in the talent that understands the strengths and limitations of the data and could identify “suspect” data/insights. In essence, with the appropriate “adult supervision,” practitioners with a better understanding of statistical theory and practice, the dirt in the data can be washed away.
If a company could not hire new personnel, consider leveraging business experts—people who know both your industry and the nuances of data analytics. Experts who understand the business problem are critical to providing insights into the data sets we use. Data scientists have experience with the tools and techniques, but often we lose track of the subtle business questions that can make or break any analysis. A business expert can often review and validate (or invalidate!) conclusions rooted in data with a brief “smell test.”
What should Data Brokerage companies consider to improve their products?
We have suggested how analytic users of data brokers should attempt to better understand the limitations and strengths of typical consumer and household marketing data. As well, we have several suggestions for data brokers who are looking for ways to expand the utility of their data products, provide greater consumer transparency, and foster more harmonious relationships with regulators, politicians, the public-at-large and end-customers.
1. Increase transparency - Data marketers should consider publishing the source, the frequency of update, the basic process for updating and any special data considerations for all data fields within its data products. The data should be traceable in some form back to the original source of information. It would be useful and important for consumers to understand exactly where their information is coming from and potentially how to make corrections at the source. Data brokers should explore ways to convey this information without disclosing their intellectual property or trade secrets.
2. Proactively inform and provide correction advice - Most data brokers have a significant number of email addresses or other less technological ways to contact consumers. As a result a process could be constructed at reasonable cost to proactively reach out to consumers with a snapshot of an individual or household data record. This information could be conveyed in a user-friendly format with explanations and links for the consumer to fix data in error. If consumers could be incented in some way to review their data profiles and have the opportunity to fix their data proactively in the system or file of record, it might be somewhat harder for consumers to complain about incorrect data in a data broker’s file. Opportunity for consumers to review and correct their personal or household data could be pushed to them versus the method currently being used whereby consumers can go to data broker websites or request reports by mail if they have the inclination and curiosity.
3. Master Data Management (MDM) - Master Data Management (MDM) is a method to check that data is properly related to other data, consistent in form and content, and logically points back to a single source. Our research revealed some obvious MDM issues whereby a field has one value and a different value for the same field elsewhere in the data broker’s system (date of birth was discussed above). While we are sure there is a logical technical reason for such inconsistencies, the mere fact that such inconsistencies exist suggests a deeper holistic problem that needs attention.
4. Assess holistic data management process based on changing demographics - In the same way that virtually every area of the business world is assessing how to better serve the changing demographic landscape, data brokers should consider how they too might improve their products given these population shifts. For example, the problems we saw in our research with data on US based consumers who are foreign born but well established in the US seems to require significant attention. The patterns in the results of our research indicate some core MDM and data strategy issues worthy of priority focus.
5. Are the data sources used the most accurate or complete? - We discussed previously, several categories of the broker data that appeared to have systemic accuracy issues. Perhaps data brokers should holistically assess if they are obtaining their data from the most timely and accurate root source. We know that some of the problematic data is maintained by government entities that make the information available through Freedom of Information law or subscription mechanisms. Inconsistencies between official government maintained data and the data that has made its way into data broker repositories seem preventable.
Greater Attention to Detail Is a Win-Win for All
With the specter of data privacy, data accuracy, and the fairness of data usage continuing to be front and center in the public and regulatory dialogue, it is important for the purveyors of consumer and household marketing data to reassess if their business processes are as effective as the marketplace demands. The data brokerage industry has a significant opportunity to enhance its role and reputation in the information economy through investment in new processes, systems modernization, innovative data acquisition and aggregation. Once done, greater accretive synergy seems possible between consumers and businesses as more accurate marketing data helps to create more bulls-eye value in the entire marketing ecosystem. As Henry Wadsworth Longfellow famously said, “it takes less time to do a thing right than to explain why you did it wrong”.
Follow me on Twitter (@johnlucker) and email me at [email protected]
Reference
*1 - https://www.ftc.gov/news-events/press-releases/2012/12/ftc-study-data-broker-industrys-collection-use-consumer-data
Authors
John Lucker is a principal and practice leader with Deloitte Consulting LLP and is the Global Advanced Analytics & Modeling Market Leader.
Ashley Daily, PhD is a consultant with Deloitte Consulting LLP’s Advanced Analytics & Modeling practice where she is an expert on how demographics influence business processes.
Adam Hirsch, FCAS, MAAA is a manager with Deloitte Consulting LLP’s Actuarial, Rewards and Analytics practice where he is a Property & Casualty actuary specializing in forecasting and trending analyses.
Michael Greene is a senior manager with Deloitte Consulting LLP’s Advanced Analytics & Modeling practice focusing on innovating new ways to improve business operations and results using data science.
Co-Founder at Chariot悦旅海外婚礼
9 年When it comes to big data, it always needs to be played carefully. Web Presence In China CEO Jacob Cooke explains how western companies can use Big Data to win in China ecommerce: https://www.thoughtfulchina.com/how-big-data-can-help-your-china-e-commerce-strategy-en.html
President, Risk Acuity LLC
9 年Nice article John, and its been a while but I hope all is well with you! -- the skeptic in me simplifies the issue and leaves little hope for great corrections: data brokers keep the smoke and mirrors to sell the notion to a wider advertising audience -- The more "categories of interest" per entity the wider range of clients available to pitch the hype - when you are selling a notion in such an emerging area with little comparison for quality, why bother with accuracy? Line the suckers up and sell them a story -- I would bet in the end the simple value in having "live" or more current email/facebook/twitter etc. addresses is the dominating factor in success for any of this data brokering for retail ---
.
9 年Pink elephant well highlighted! Much the reason for the stream of poorly targeted campaigns that have consumers filtering anything that looks remotely ad'like out. A marketing panacea that reads much like my early school reports...
Sr. Consultant
9 年Great read for way forward and a reminder on "Garbage in Garbage out". The difficulty with public data is that it is very difficult to correct, so how much weight should be given to social data in modeling and how will really keep the competitive edge.
Personally, I believe the allure of this "Big Data" stems from the potential we see here as data scientists. We know these data can improve models generally speaking. But how many bad predictions are we driving for each good one? I wonder if the other types of digital exhaust like cookie or social media data can make it over that bar.