The data analyst as a source of variability
DALL-E3 generated image.

The data analyst as a source of variability

I love this quote about science from Nassim Nicholas Taleb's book Fooled by Randomness:

????????"Science is great, but individual scientists are dangerous."

Taleb goes on to describe that while science is objective, scientists are humans and are flawed with their own biases. The quote sums up perfectly how you might feel after reading a news story about a researcher falsifying data or engaging in irresponsible conduct of research.

Although less publicized, I have heard plenty of stories from companies where employees have "fudged the numbers" to get a project done quickly.??

But these are extreme examples of humans engaging in poor data practices. Most analysts do not falsify data but take value in their work and the various products that result from it.?

I've been thinking about how we often don't consider the data analyst as a source of variability in our results. We often think that the testing of a hypothesis can only show one result. Yet when multiple people analyze the same problem with the same data set, differences are apparent. Each analyst brings their own set of experiences, skills, and expertise that can interpret the data and its results in different ways. Collectively, this is known as the “many-analysts” problem.

I had a many-analysts problem episode happen to me recently. A colleague and client asked why I was using 120 sample plots after filtering a dataset, while a previous analysis used 119 sample plots. It’s the same dataset after all! It was difficult to answer specifically, other than the fact that the analysis was done by two different people with their own code scripts. (And there was a lot of filtering and querying to obtain the dataset of interest.)?

We recognize that data is a source of variability, but we often overlook the people analyzing it. We can all agree that there is a science to working with data, but there’s an art to it, too.?

To begin with, it is a good idea to have multiple approaches to address a problem, as analysts bring diverse experiences and solutions to tackling a data problem. There are a number of “itty-bitty” choices that are made throughout an analytical workflow, many of which occur when tidying the data. A common choice is how to treat outliers in a dataset, i.e., delete, keep, or modify them. Another choice is which analytical method to use and the assumptions behind it. For example, a traditional hypothesis test comparing two datasets may yield differences if you instead use a machine learning model to determine differences (which is likely nonparametric in design.)??

An excellent paper published last year by Kummerfield and Jones describes some of the pitfalls of the many-analysts problem and how to mitigate them. First, they recommend having a clear overarching question. For analysts, it’s best to type this question out so that you, the decision makers, and other stakeholders are on the same page.?

Second, the authors propose having a clear understanding of how the overarching question will be analyzed. This is where the data and statistical methods come in.?

Finally, they find that a lack of expertise on a team can lead to errors and mistakes that lead to varying results. A well structured team with input from individuals that collect, analyze, and make decisions on the data and results can help to broaden the expertise on a project.?

I encourage you to think of how the data analyst is a source of variability in your own analyses and workflows. Too often we separate the people from the data, but understanding how the two interact can yield more insights to you and your organization.

Sarah J Satre

Freelance | Projects that make a difference

10 个月

Valid point!

Jon Lunsford

Forest Manager | GIS Analyst

10 个月

Great info Matt Russell. Data analysts should document their work thoroughly so others can verify and/or replicate if needed. Also beneficial several months later when you have forgotten what you did and how. Jupyter Notebooks are a great way to document and show your work.

Taylor Wilson, ACF, CF

Owner | Forester | Data and Financial Analyst

10 个月

Love this statement you write regarding first having a clear overarching question you want answered before trying to dive in to the dataset. Without this it is to easy to get sucked into the data and miss your original purpose for starting the project/analysis, etc.

Micky Allen

Forest Biometrician | Podcaster | Dad

10 个月

This is a great thought , Matt! I’ve recently had to explain quite a lot that given the same set of assumptions and data, no two analyst would arrive to the exact same place when using FVS. But that could be expanded to almost any set of data/model.

要查看或添加评论,请登录

Matt Russell的更多文章

社区洞察

其他会员也浏览了