The Temple of Data.
The world seems to have accepted without question that the more data that you have, the more that you understand and the better decisions you can make. If you are part of the business community and not “data driven†then your chances of attracting investors are very remote. Somehow, we have got to a point in the business world that even if you are running at a loss, then hauling in massive troves of customer data is a badge of honor; all made acceptable by the fact that some-day you will make sense of it all and find a way to make money from those customer data points. The consensus seems to have become that data solves all the world’s problems.
It all starts with a question.
Every company that signs up for that “data driven†badge of honor is going to start by asking their customers a few questions. Often it all starts with something as simple as “Please rate your experience of doing business with us on a scale of 1 to 10â€. Then some clever person in the office goes online and sets up a survey, sends an email to your fifty customers and sits back feeling good about a job well done.
Days later, everyone gathers into the front conference room and looks at a wonderful looking chart of all the tabulated responses. Your inspired conclusion is that your customers love your company which now then becomes the second slide in every investor pitch for the next decade.
At this point, we should stop for a moment and think about the numbers a bit. We went off and asked fifty people to choose from ten discrete options. This means that for each option, on average there will have been five people who selected it. In our example we can see that more people selected option seven than any other; giving us a level of confidence that our company is on the right track.
One important observation when running a survey of this kind is that we should always try to have far more people responding than there are possible responses on it: if you do not then there will simply never be enough results to spot any trends. If our survey had just asked two customers for their input, then it would simply not be possible to conclude whether the company was doing well or not.
The ball starts rolling.
To graduate to “data-driven, level 2â€, you soon learn that it would be much better to ask all your customers more than one question. What you clearly should now do is follow up with an amazing question like “Please rate how accurate you found our website on a scale of 1 to 10â€.
As you know by now, I am not actually that interested in the results of the survey itself. What is far more interesting is to consider that if both the first and second questions had ten discrete options each, then there would be one hundred possible outcomes from which any customer could choose. Unfortunately,?in our survey we had only asked fifty customers the question, meaning that on average you do not even have one response for each possible outcome; and while you might spot that there are more responses in one area than another, it should become clear that even with just two questions it is now getting increasingly hard to come to firm conclusions from the survey data.
When you show this data to your marketing team, it is certainly the case that someone will likely point out that you can solve all of the problems by simply asking more people to respond to your survey; and thanks to your new “data-driven†outlook on life combined with the new investment that it brought into the company: you have more customers than before. So, you go out into the world and ask five hundred customers to answer your survey. When all the results come in, it will become obvious when there is a firm conclusion that people “who find your web site clear†are now more likely to find “doing business with you easier.†You now have another great slide for that investor pitch!
The curse of dimensionality.
About now, your company has grown up enough that the marketing team made the bold decision to hire an external consultant to take your surveys to the next level. In collaboration with their "expert focus group", you produce twenty questions that you want to know how your customers view – and since you are now paying a consultant real money for this survey it is worth asking more questions!
Your amazing new consultant that does the data analysis will not tell you this, but if each of the questions that you asked had a response range from 1 to 10,?then the number of responses that any individual could have taken is one of hundred quintillion options (that is one followed by twenty zeros!). Even if you had every person on planet earth as a customer you would not have enough responses to come remotely close to filling every box in a grid of all the possible options, let alone have enough responses in each box to draw any firm conclusions.
领英推è
To help illustrate the scale of this problem, if you give your questions to one million customers then you would only have a single customer response for each 100,000,000,000,000 possible options. This is much like the emptiness of space, while you see stars in the night sky, they are so far away from each-other that light can take millions of years to travel between them. Your data is much the same, the space of all options is essentially empty.
The most common measure of how similar two data points are is simply to measure the distance that they are from each other. If the response from one customer is very close to the response from another then the view of those customers is similar, and if there are many customers close to each-other then it is indicative that this represents some level of agreement amongst a number of customers.
While it is tempting to think that the points might group together like galaxies do in space, as the dimension of your data (the number of questions you ask each customer) increases, mathematics has a very nasty surprise for us: the points start to become almost uniformly spread out in space and so can no longer be distinguished from each other. If you asked a vast number of questions (and vast might only mean about one hundred), then on average every star would be the almost exactly same distance from every other star, and you would no longer have any discernible way to look for clusters of similar customers at all; every customer would look equally different from every other.
The meaning of this is truly profound and unexpected: as you seek more information by asking more questions, your ability to deduce any clear conclusions from the results decreases too almost nothing. Our intuitions about data in high dimensional spaces are very often wrong.
Seeking answers in the noise.
You probably started reading this article thinking that it would give you some great insights into understanding data; about now a crushing disappointment is starting to set in. I made the case that the more data you collect the harder it is to make any sense of it. This runs totally counter to what we do every day when we do things as basic as reading the results of a customer survey.
In the real world, the practical way we work around all of this is by simply not looking at all responses to all questions at once, instead we work on small subsets at a time. A common choice is to select two questions: for instance, “how much do you like our company?†and “how likely are you to buy a product from us?†and look at how they correlate. In mathematical terms, what we are doing here is taking our “twenty question space†and ignoring 18 of the questions and projecting it down onto a “two question space†that is far more manageable.
Our excitement of having found a solution to our data woes is short lived however because there is now a new problem now hiding in plain sight: why did we choose these two specific questions? The answer of course is that while it seemed logical to choose these two because if “a customer likes our company†then they are also “more likely to buy one of our productsâ€. But when you are choosing the questions based on what you think the data should show you, are you really learning anything at all?
The unbiased way to look for better meaning in our survey would be to look at all 380 combinations of two questions and from all possible pairs observe which one gives us the most decisive insight. This is how you find those unexpected new discoveries in data are not what you might have guessed: who might have thought that a strongest correlation would be between “how much you like our company?†and “how many years of education do you have?â€.
Astute readers are now going to ask: if we want to learn unbiased information about the data then why select question pairs and not combinations of three, or four, or even all twenty questions? The rather obvious but inconvenient answer of course is that there was no good reason at all. If we are genuinely interested in our data then we really should not just be considering just the 380 possible question pairs, but rather every combination of any number of questions, of which there are billions and billions and billions and billions of possibilities, and each and every one is going to give you a different view on the data.
If you take ten dice and roll them “billions and billions and billions and billions†times, then eventually by chance you are going to roll all sixes – and as much that this seems some amazing coincidence when you do it, it is nothing more than the law of probabilities and large numbers: do something enough times and there will be an illusion that something special happened. In much the same way, if you consider all the possible question combinations in our survey - there are such an enormous number that by pure chance you will always find some that seem to provide amazing insight, but just like the getting all sixes on dice, this insight is nothing more than an illusion.
The curse of dimensionality haunts us again, and each time we try to find some meaning by considering the full dimensionality of data we simply find a different version of the same basic problem.
Meaning in the noise.
As much as we all want there to be easy solutions to these problems, there are none. As the dimension of data increases it gets increasingly hard to truly understand what it tells us. As more and more of the decisions that impact our lives are made using data, it should be a wake-up that many of the intuitions about it that we all seem to take for granted start falling away very quickly as we look at them closely.
Each time I hear about “data driven†companies and decisions, I cannot help wondering whether people truly understand how to understand the depth of what is in their data or whether all they are seeing is a reflection in a mirror showing them exactly what they went looking for. All too often, we use data to try to measure things when we would be much better served taking the time to truly understand ourselves.
Video & LED Sales Manager UK & Ireland | award-winning theatrixx signal converters | xVision? Nomad TPEP? rental LED | xPressCue? 4K mediaplayer | xVision? constant gravity truss | xVision? install LED series.
2 å¹´Good read.
With statistics done poorly we can demonstrate almost anything. It is not so much that big data is worthless; it is complex. Great point about valuation and data volume; some haystacks have no needles.
Connecting Clients with Solutions as a Leading Sales/Business Development Director in the M&E Industry.
2 å¹´A great read. Thank you Andrew for your insight.
TD Garden, Director of Audio/Video
2 年Your so amazing talented. I love following your forward thinking IP video vision. Please remember it’s always tough living in the future! Unfortunately the tail waggles the dog in this industry. You rock and always will my friend! Stand tall! You made it to the top of the mountain! Don’t get discouraged ever…our industry needs you!