登录查看更多内容

Are you respecting the story in your data?

David Weik

Data always tells a Story | Sr Software Development Engineer @ SAS R&D

发布日期: 2021年4月27日

One of my fundamental philosophies is set out in my LinkedIn headline: data always tells a story.

By this, I don’t mean that you should manipulate your data to tell the story you want—although there are plenty of people prepared to do that. This is also not about how you present data science findings so that your insights are clear to others, although this may be an important use of storytelling. Instead, I mean that if we look properly at any set of data, we will find that it has its own story.

Normally when we think about data, or look at a table of raw data, that is all we use. We do not think about the table in the context in which it was created, or in which it exists. However, thinking about where the data came from, and how they interact with their environment and with other variables, helps us to understand the data on another level.

It is easy to focus on calculating defined KPIs, or examining an established relationship. However, the real story in data often lies in how they are connected to other, new data. We need to examine the detail of the data, but we also need to look at the bigger picture of how the data fit into their context. This new story only emerges by asking different questions.

Asking the right questions

Like many other people, I grew up reading Douglas Adam’s Hitchhiker’s Guide to the Galaxy. I particularly enjoyed the idea of creating a computer to provide the answer to the ultimate question in the universe—and then realizing that actually, you also needed to know the question. However, it took me until adulthood, and work in data science, to fully understand the importance of asking the right question.

“Forty-two,” said Deep Thought, with infinite majesty and calm - Douglas Adams The Hitchhiker’s Guide to the Galaxy

In data science, I think there are certain questions that must be asked of any data set, and which help it to tell its story. These include:

What are the possible biases in this information?

We need to be able to set the data in context to understand whether anything in it might bias the answer. This is easy on the face of it. However, we already know from a number of mishaps with computing that it is anything but simple in practice. We all have inbuilt, and often unconscious biases that affect what we see in data—and may mean that we don’t notice an obvious problem such as a gender or racial imbalance. More diverse teams can help to overcome some of this, but it is not easy without asking the right questions.

What are the constraints on using these data?

It is not always possible to use all data in every way. Constraints such as the General Data Protection Regulation, or other privacy regulations elsewhere in the world, may mean that we simply cannot use some data for some purposes. It is important to be aware of this before you start.

How can we enrich these data by drawing on their context?

This is perhaps, for me, the most important question. Above all, we need to set data into their context, and consider what else that tells us, or might tell us, about what we need to know. For example, we might consider a dataset on agricultural yields. We might have information about fertilizer input, seed type and irrigation. However, what about weather? If we can add weather data, that will enrich our information because it has a huge impact on crop yields. In fact, without weather data, irrigation information might actually mislead us.

Making links and gaining insight

One of my favorite examples of using this kind of context is a map that hangs on my wall. It looks a bit like the schematic maps of the London Underground, but maps the big trends and sub-trends in the world. The idea is to show how they might relate, across sectors, industries and countries. For example, one link is between health and self-tracking, because many of us wear something like an Apple Watch or a Fitbit, which helps us to track our own health. However, there are many other trends linked to self-tracking, such as security or safety for older people or people with dementia. This might help us to ask how security and health might be connected.

Seeing trends mapped in this way makes it easier to visualize where ideas might run in parallel or intersect. This, in turn, makes it easier to ask the right questions, and understand the story in the data.

https://www.zukunftsinstitut.de/documents/downloads/MegatrendMapZukunftsinstitut_120918.pdf

Ulrich Reincke

Principal Data Scientist, Customer Advisory, SAS Institute

3 年

Great post David! We always forget to remember these rules in our daily work under time pressure

1 次回应

Kelly Lu Murray

Advanced Analytics and Artificial Intelligence Advisor

3 年

Great read. Thanks David. An holistic approach to data is so much more valuable.

1 次回应

Yen Nguyen

Solutions Engineering Manager at Forethought

3 年

Thanks for your insightful thoughts David Weik. Loved the reference to Deep Thought. :)

1 次回应

Ina Conrado

AI Innovation Lead & Senior Solution Engineer, Salesforce | Board Member at BI & Analytics, Dataforeningen

3 年

Love the reference to Hitchhikers Guide to the Galaxy and the focus on bias in our data. Interpretability is an important factor here too. Once we make use of the data we need to be able to interpret and explain what we have found.

2 次回应

Spiros Potamitis

Analytics at AWS

3 年

Loved the idea that each dataset comes with a unique story! Nice read David - thanks for sharing.

2 次回应

查看更多评论

要查看或添加评论，请登录

David Weik的更多文章

Generating fake data with ChatGPT

2022年12月9日

Generating fake data with ChatGPT

The goal of this post is to play around with the capabilities of ChatGPT to generate datasets. While doing so we will…

5 条评论
How SAS Changed the Game for Me as a Data Scientist

2022年12月7日

How SAS Changed the Game for Me as a Data Scientist

As a data scientist, I have always struggled with the challenges of working with data. I started out using #Python and…

37 条评论
Why it is time to look at open data differently

2021年8月23日

Why it is time to look at open data differently

Open data refers to data that is freely available for anyone to use, reuse and distribute, without any charge, or…

4 条评论
Socrates - a 3D scan

2020年7月19日

Socrates - a 3D scan

My mentor went into retirement in February 2020 and on his last day he gave me this nice statue of Socrates…

2 条评论

Are you respecting the story in your data?

David Weik

Data always tells a Story | Sr Software Development Engineer @ SAS R&D

David Weik的更多文章

社区洞察

其他会员也浏览了

The Heart of Data Professionals: What is Your "Why" as a Data Professional?

Beyond the Basics: Pairing Your Data with the Perfect Hypothesis Test

The Key to Insight Discovery: Where to Look in Big Data to Find Insights

Avoiding The Biggest Big Data Fallacy, Where More Data Means Higher Accuracy With Real-World Examples

The Pitfalls of Data Science (and how you can avoid them)

How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

‘Sniff Test’ Your Data: Why and How Decision Makers Must Learn This Skill

How useful is your data?

Is Big Data A Big Risk For KPIs?

Data Entropy

David Weik的更多文章

Generating fake data with ChatGPT

How SAS Changed the Game for Me as a Data Scientist

Why it is time to look at open data differently

Socrates - a 3D scan

社区洞察

其他会员也浏览了

The Heart of Data Professionals: What is Your "Why" as a Data Professional?

Beyond the Basics: Pairing Your Data with the Perfect Hypothesis Test

The Key to Insight Discovery: Where to Look in Big Data to Find Insights

Avoiding The Biggest Big Data Fallacy, Where More Data Means Higher Accuracy With Real-World Examples

The Pitfalls of Data Science (and how you can avoid them)

How do you evaluate if the insight you obtain from data analytics is “correct” or “good” or “relevant” to the problem domain?

‘Sniff Test’ Your Data: Why and How Decision Makers Must Learn This Skill

How useful is your data?

Is Big Data A Big Risk For KPIs?

Data Entropy