Data Illusions - Summarizing War and Peace

Data Illusions - Summarizing War and Peace

I - Summarizing War and Peace

War and Peace has 361 chapters and 587,287 words and 559 characters! It would take an average person about 33 hours to read the entire book, and some scholars have spent their entire life dissecting the masterpiece. How would you briefly explain the book to someone who has never read it? 

The shortest “summary” could be something like: "War and Peace broadly focuses on Napoleon’s invasion of Russia in 1812”

A slightly more elaborate“summary that gives you a bit more flavor that the book has some romance as well could be “ The book follows three of the most well-known characters in literature: Pierre Bezukhov, the illegitimate son of a count who is fighting for his inheritance and yearning for spiritual fulfillment; Prince Andrei Bolkonsky, who leaves his family behind to fight in the war against Napoleon; and Natasha Rostov, the beautiful young daughter of a nobleman who intrigues both men As Napoleon’s army invades, Tolstoy brilliantly follows characters from diverse backgrounds—peasants and nobility, civilians and soldiers—as they struggle with the problems unique to their era, their history, and their culture. And as the novel progresses, these characters transcend their specificity, becoming some of the most moving—and human—figures in world literature.1

These summaries are at best "better than nothing", but none of them does any justice to Tolstoy’s masterpiece. In fact the only real advice to your friend, would be to read the entire book slowly and meticulously, making note of every single character in a notebook so they don’t get confused with the intertwining plots.

And yet, it’s on similar very imperfect summaries of data that many business decisions are made: shops are closed because their quarterly results are below expectations, salespeople are promoted or fired because of encouraging or disappointing yearly performances, product lines are discontinued because they are not as profitable overall as other product lines … While these decisions are sometimes the right call, I will show in the following article why it is often wrong to make these decisions without a proper understanding of the larger data context. 

So are we supposed to “read” all the data?

Imagine indeed that you want to make sense of a data set that is roughly as sizable as war and peace(say ~500,000 data points). What’s the best way to understand it, describe it, draw insights and act on it? With the above parallel in mind, you should probably spend a few months looking at the entire data set, observing and making sense of every little data point and how it connects with the others. This of course would be impractical: Not only would it take a very long time to observe all the data, but more importantly, our human mind does not have the capacity to assimilate or make sense of thousands of disparate data points. When reading a novel for instance, our brain has an innate ability and has been trained since childhood to build a coherent narrative connecting concepts together, even if you when reading these concepts weeks or months apart. However our brain has a much more limited capacity to build a coherent image from disparate data. To some extent, understanding the 500,000 data points in the same way we understand the 500,000 words of War and Peace will for now remain a distant dream.

As a workaround, we analyze and summarize data using statistical descriptors such as averages, medians, standard deviations and we use tools such as correlations, clustering and pattern analysis … We use of course technology tools, and increasingly artificial intelligence to dramatically increase the speed at which we can analyze and summarize the data. These are approaches designed to build a “bridge” between the complexity of the data and the limited understanding of our human minds by simplifying the message , but without the proper usage, they can create as well an incomplete, and even erroneous picture.

Hypothetically, if there was a super-human mind that could make sense of raw data in the form of millions of data rows, that would be a way to actually achieve a perfect understanding of every single data point. (You may argue that Artificial Intelligence is some version of this - I think there are opportunities indeed for A.I to enhance this understanding but this should be the topic of a separate article) This super-human mind would of course view any attempt to “summarize” the data as a downgradefrom their perfect understanding. And that’s how we should view statistical descriptors of data - in particular averages and variances - as convenientbut imperfectways to look at the data. Quite often they carry inherent biases and can lead outright to the wrong conclusions. They are perfect examples where a “half-understanding” can actually be worse than “no understanding at all”. 

In the following series, I will illustrate some of the judgement errors and biases we make when we “jump to the conclusion” using these summaries with a few practical examples, and explain why the “obvious answer” is not as obvious as it seems. Each chapter will cover one topic such as: Why averages can be be misleading, Averages vs. Weighted Averages, perception errors of small samples, cognitive biases, the illusion of “non-randomness” ...

Who is this for?

This series is helpful for both data authors (people preparing and presenting the data, like consultants) and data readers(recipients of the data, or people being lectured by consultants). 

For data authors, the goal is NOT to provide advanced data analysis techniques. These are covered today in a large variety of books. In fact the key challenge for data authors today is what I call the “Ferrari illusion”: It’s the illusion that you are a better pilot becauseyou’re driving a Ferrari, when in fact, if you want to drive a Ferrari, you have to learn yo become a better pilot. Otherwise all you are doing is increasing the chances of crashing when you step on that gas pedal. It can take less than a minute to create a pivot table in excel with averages by category, and I am still amazed by how fast and easy tools like Tableau can summarize and create beautiful visuals of hundreds of thousands of data points with simple drag and drops. Drag left, drag right and magically you have the fancy chart with he averages that you need. The tools are becoming increasingly powerful, fast and fancy, but for the data author, it can be tempting to confuse fanciness and power with accuracy. My goal is this context is actually to provide a reminder of some of the basic principlesthat even advanced data scientists sometimes seem to forget, just like experienced climbers often stumble on the easiest paths when they’re not paying attention. And of course the tools can then become extremely handy with the right focus in mind. 

For "data readers" / people being presented the data, the goal is to help ask the right questions, and not fall into the traps set out for them, often intentionally, by the data authors. These “tricks” can be used deliberately, especially by expensive consultants, in order to substantiate the erroneous conclusions their senior partner already gave the client before they had the faintest clue what the data will say. They still get to sleep well at night because sometimes “the end can justify the means”

The examples below will serve to illustrate that we need to be skeptical and ask the right questions, especially when being lectured by consultants projecting immaculate slides. My general rule of thumb is: the more immaculate the formatting, the more polished the presenter and the glossier the slide is, the more you should be suspicious and ask more questions. 

In the following chapter, the first of the series, I will illustrate the concepts above for the most popular of the “summary” statistics: the average. I will show with concrete examples several situations where averages can be misleading:

  1. The data being averaged having more underlying layers than the layer at which it is being averaged
  2. Averages based on insufficient data points
  3. Averages conveying a misleading message when the data layers have different levels of dispersion 

II - Why averages can be misleading - Never cross a river that’s 4 feet deep on average

A few years ago, I co-founded a catering business where we hire dozens of refugee and immigrant chefs and serve their delicious meals for team lunches or corporate events. The company chefs have developed dozens of dishes that come from their culinary heritage - while all exceptional in my mind, their cuisine can be more or less successful with the typical American Palate. In our continuous effort to provide the best to our customers, we wanted to understand who our best chefs are, we sent a survey where customers can rate every chef’s cooking. At each event we cater, each one of the chefs typically prepares one dish.

The average of the results for every chef are represented below


The immediate conclusions are obvious:

  1. Faven has much lower ratings than her fellow chefs. We should probably fire her, or at least ask her to help in the kitchen instead of torturing our customers with unknown and apparently unappreciated flavors?
  2. Fatima is our best chef! Let’s promote her and advertise her cuisine some more!
  3. Farid, Batsa and Azar are equally great cooks with similar ratings. Nidal is a bit lower but still ok. Kosala is only at 70% - still ok but we should probably feature her less.

A) How many layers does the data have?

Simple, right? Not so fast. It turns out that all of these conclusions were wrong or at least misleading: Our survey did indeed ask each customer to rate the chef, and we know that each chef prepared a different dish for each event. What if, instead of averaging just the results for the chef, we averaged the results for the corresponding dish? The results are below:

As we can clearly see, the most popular dish is Egyptian Moussaka, and it is the creation of ... chef Faven! By relying on our average per chef only, we were about to fire the person who responsible for our most popular dish! Chef Faven has 2 other dishes that were bringing her average down, and those are manifestly not so popular with New Yorkers and should be eliminated. But she clearly has some talent and we should try to encourage her to create some new dishes.

And here’s another illustration of the same concept. Our customer service representative Alex had the foresight to ask each survey respondents whether they were vegetarian or not. When we average the responses introducing that dimension, the results are as follows:

The second wrong conclusion here is that Kosala’s cuisine is actually the most popular with Vegetarian people! So once again, the average per chef only was ranking Kosala as quite average and was missing this critical insight (especially that many of our customers are Vegetarians)

What does this mean?

In our example above, thinking about rating dishes instead of chefs, or about introducing the types of customers can seem obvious - and it serves essentially to illustrate the point. But you can see how this can become misleading in the following similar example:

"A company wants to shutdown underperforming stores, and it looks at the average revenue per store". But imagine that the store with the lowest average revenue is the one that happens to sell most of the product with the highest margin for the company, or one that has the most potential for growth? This is a very similar example to the chefs example, and I am sure many of you have encountered many times in one shape or another.

B) Do I have enough independent data points? 

Let’s now have a look at the number of reviews each chef / dish received. 


A noticeable fact from the table above is that Fatima got significantly less reviews than her peers, actually about a 100 times less! It turns out Fatima is our newest chef: while other chefs have catered typically >40 events each, she has only catered one event on 9/12/18!










Let’s now zoom-in on that event that Fatima catered and look at the specific ratings:

Hmmm, the people at that event were particularly generous with their ratings, not just with Fatima, but with everyone! So what say we now? Still certain Fatima is the best chef ever? Not so sure huh? If you already promoted Fatima, why don’t you call her back and say “Hey, I’m really sorry bit this promotion was a bit rash! Your cooking is promising, but we really don’t know yet if you’re a good chef, you still have many mountains to climb before we can say anything for sure. Keep up the good work for now and let’s reconvene in a few months” 

It's not just the number of observations ...

You could call this “statistical significance” but it goes beyond just number of observations. In this particular example there was yet another dimension to the data (the event) which made the data points not independent (i.e not only Fatima had very few data points, but they also were not independent since all relating to a single event) 

C) How dispersed is the data?

Ok - but how about Farid, Batsa and Azar? They seem to be quite similar: similar average ratings overall, similar average ratings per individual dish, and all have had significant number of data points. The averages in that case must have given a fair representation right? Let’s have a look at the individual ratings and see if we missed something. The chart below shows the # of ratings in particular bin (e.g. 70 to 80 would be classified in the 80 bucket) for each dish. 

Batsa’s ratings look different don’t they? They are all in the 70s and 80s bucket whereas Farid, Azar and Kosala all have more extremes (either lower or higher). Said differently, pretty much everyone think Batsa’s dishes are very good - nobody thinks they’re average nor exceptional, whereas for the other chefs, we have some customers who think their cuisine is exceptional or average! Here’s another opportunity to slice again across all the dimensions of the data (per event, per type of customer …) to understand more. I will spare you the details, but essentially it turns out that Farid and Azar’s dishes can get exceptional ratings with people who label themselves “adventurous eaters” and like to try different and quite exotic things, while getting sometimes low ratings for people who prefer to play it safe - you know who I’m talking about: those who would only order a Pad Thai in a Thai restaurant and will never try anything else … Batsa’s dishes in contrast are the “vanilla” flavor of ice cream: everyone likes it but nobody is crazy about it! This turns out to be very valuable information as it will help us tag our dishes: “Adventurous” or “Crowd pleaser”.

More technically this is measured by the variance or standard deviation across each dimension (smaller standard deviation indicating less variance), however I find that visual representations like the above chart are more compelling than just using standard deviation measures. 


“ Never cross a river that’s 4-feet deep on average” ... When and how can we use averages

One of my favorite authors Nicholas Taleb put it nicely when he said “Never cross a river that is on average 4 feet deep”. Essentially he is saying that data variability can make the averages useless and often lead to mistakes as we have illustrated above.

But does that mean we should always discard the conclusions that averages give us? Let’s not be that radical shall we? Averages can be helpful when one is aware of the context. For example, if I told you that the average selling price of a Honda Odyssey 2018 EX in NY is $29,978 whereas the average selling price of the Honda Pilot 2018 EX is $39,760, that is actually a useful information to conclude it will be cheaper to buy the Odyssey and give you an idea of what you will spend. The context of this data is:

  1. It is based on hundreds of thousand of independent observations across points of sales in the US
  2. It is fairly specific (we have the model number and geography and other dimensions (like color) are not that likely to affect the price significantly
  3. The data is not that dispersed: there is of course the very talented salesman that was able to sell it for 5% higher or your Iraqi friend with an ex-career trading carpets in the Baghdad market who will get it 7% cheaper for you. But overall, the data looks like a clean and fairly narrow bell curve 

But let’s take another example where average prices will not be helpful. Suppose you have to advise your crazy rich Asian friend on how much to bid for a Picasso painting and a Van Gogh painting going on auction at Sothebys. You pull-out average data saying that Picasso on average sold for ~$7M at auctions and Van Gogh at ~$13M. This as you have guessed is almost useless information for your friend because:

  1. There are not that many observations of sales for Van Gogh or Picasso and those tend to change over the years
  2. The author of the painting is not the only predictor of its price (a bit like the chef and the dishes above). There are other dimensions (like the condition of the painting for example)
  3. Most importantly, there is a significant dispersion in the data with significant extremes (>$100m for Picasso and >$80M for Van Gogh). It is far from being a clean narrow-shaped bell curve 

Another everyday example that I personally find very interesting is the restaurant 5-star rating system on yelp or Google. With this system every restaurant is essentially reduced to a single grad (i.e. 2.5, 3, or 4.5 on a 5-star scale). This is the perfect example of averages over-simplifying the message and erasing the idiosyncrasies of each restaurant. With this rating “crowd-pleasers” naturally rise to the top, a bit like “Vanilla” ice cream being more popular than green tea ice cream. But if like my wife you are a big fan of the green tea ice cream, then you may be missing out by simply relying on the star system. The “wisdom of the crowd” is sometimes the "tyranny of the masses", and in many ways, the old-fashioned restaurant reviews (when written by an unbiased professional) can give a much more colored and flavorful review of restaurants. 

That’s not to say that star ratings and crowd wisdom should be neglected, but it is more appropriate for things that are homogeneous in nature: for example business hotels where the rating of hygiene and comfort, which are critical to the ratings, are quite universal and don’t depend that much on “tastes” like restaurant food does. 

As a sidenote I should highlight that I have come across one notable exception where the average is actually more informative than the collection of data points: the "wisdom of the crowd". If you ask people to guess how many jelly beans are in a jar, you will notice that the average of their answers is the best approximation (there will be "experts" that will do better than the average, but they will not be as "consistent" as the average - i.e. if you repeat the experiment many times, the average will be the winner over all the experts. This is the "wisdom of the crowd"effect, a fascinating phenomenon that deservers its own separate chapter.

What does this mean for data authors and data readers 

Overall you need to adopt a skeptical attitude and get the right context. Data is tricky, so remember that any synthesis is imperfect. That can be ok and necessary of course, however awareness of this can help immensely:

  • Ask yourself (or the author) the right questions:
  • What other dimensions can I slice the data across? Is there hidden variability that I am not seeing?
  • How many observations are underneath each average and are they independent? 
  • How dispersed is the data? 
  • Spend time observing the data - use the tools to slice and dice across many dimensions before calculating any averages 
  • Get some basic metrics: # of dimensions, # of observations per dimension, variance across each of these
  • Ask yourself if the person presenting has an implicit interest in manipulating the messaging
  1. Penguin Randomhouse’s description
  2. Data analytics tools, machine learning and artificial intelligence can certainly dramatically increase the rate at which we can analyze and summarize the data, but the question is too what extent are they conveying a picture that is increasing our understanding of the data





J Marchino, CFA

Director, Educe Analytics

6 年

A very good insight on the Why and What behind the Data. In this world of Data Analytics, wonder how do we keep our common sense judgement intact, to not get fooled by the data. Explained with a very good example by Wissam.

Ismael GHOZAEL

Product Leader, Pacesetter, Freediving coach. Decathlon, PayPal, Zong, Safran, Adobe, Berkeley Haas

6 年

Very nicely written Wissam. Thanks for sharing. Reminds of a quote from my former Operations Prof: “If you make decisions on averages, on average you’ll be wrong”

Frederik Bay

SVP, Enterprise Strategy

6 年

Great article on the pitfalls of data analysis and tips for how to avoid them!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了