Are you respecting the story in your data?
One of my fundamental philosophies is set out in my LinkedIn headline: data always tells a story.
By this, I don’t mean that you should manipulate your data to tell the story you want—although there are plenty of people prepared to do that. This is also not about how you present data science findings so that your insights are clear to others, although this may be an important use of storytelling. Instead, I mean that if we look properly at any set of data, we will find that it has its own story.
Normally when we think about data, or look at a table of raw data, that is all we use. We do not think about the table in the context in which it was created, or in which it exists. However, thinking about where the data came from, and how they interact with their environment and with other variables, helps us to understand the data on another level.
It is easy to focus on calculating defined KPIs, or examining an established relationship. However, the real story in data often lies in how they are connected to other, new data. We need to examine the detail of the data, but we also need to look at the bigger picture of how the data fit into their context. This new story only emerges by asking different questions.
Asking the right questions
Like many other people, I grew up reading Douglas Adam’s Hitchhiker’s Guide to the Galaxy. I particularly enjoyed the idea of creating a computer to provide the answer to the ultimate question in the universe—and then realizing that actually, you also needed to know the question. However, it took me until adulthood, and work in data science, to fully understand the importance of asking the right question.
“Forty-two,” said Deep Thought, with infinite majesty and calm - Douglas Adams The Hitchhiker’s Guide to the Galaxy
In data science, I think there are certain questions that must be asked of any data set, and which help it to tell its story. These include:
What are the possible biases in this information?
We need to be able to set the data in context to understand whether anything in it might bias the answer. This is easy on the face of it. However, we already know from a number of mishaps with computing that it is anything but simple in practice. We all have inbuilt, and often unconscious biases that affect what we see in data—and may mean that we don’t notice an obvious problem such as a gender or racial imbalance. More diverse teams can help to overcome some of this, but it is not easy without asking the right questions.
What are the constraints on using these data?
It is not always possible to use all data in every way. Constraints such as the General Data Protection Regulation, or other privacy regulations elsewhere in the world, may mean that we simply cannot use some data for some purposes. It is important to be aware of this before you start.
How can we enrich these data by drawing on their context?
This is perhaps, for me, the most important question. Above all, we need to set data into their context, and consider what else that tells us, or might tell us, about what we need to know. For example, we might consider a dataset on agricultural yields. We might have information about fertilizer input, seed type and irrigation. However, what about weather? If we can add weather data, that will enrich our information because it has a huge impact on crop yields. In fact, without weather data, irrigation information might actually mislead us.
Making links and gaining insight
One of my favorite examples of using this kind of context is a map that hangs on my wall. It looks a bit like the schematic maps of the London Underground, but maps the big trends and sub-trends in the world. The idea is to show how they might relate, across sectors, industries and countries. For example, one link is between health and self-tracking, because many of us wear something like an Apple Watch or a Fitbit, which helps us to track our own health. However, there are many other trends linked to self-tracking, such as security or safety for older people or people with dementia. This might help us to ask how security and health might be connected.
Seeing trends mapped in this way makes it easier to visualize where ideas might run in parallel or intersect. This, in turn, makes it easier to ask the right questions, and understand the story in the data.
https://www.zukunftsinstitut.de/documents/downloads/MegatrendMapZukunftsinstitut_120918.pdf
Principal Data Scientist, Customer Advisory, SAS Institute
3 年Great post David! We always forget to remember these rules in our daily work under time pressure
Advanced Analytics and Artificial Intelligence Advisor
3 年Great read. Thanks David. An holistic approach to data is so much more valuable.
Solutions Engineering Manager at Forethought
3 年Thanks for your insightful thoughts David Weik. Loved the reference to Deep Thought. :)
AI Innovation Lead & Senior Solution Engineer, Salesforce | Board Member at BI & Analytics, Dataforeningen
3 年Love the reference to Hitchhikers Guide to the Galaxy and the focus on bias in our data. Interpretability is an important factor here too. Once we make use of the data we need to be able to interpret and explain what we have found.
Analytics at AWS
3 年Loved the idea that each dataset comes with a unique story! Nice read David - thanks for sharing.