Pitfalls to avoid during the Data Science Journey
Pitfalls to avoid fig 1

Pitfalls to avoid during the Data Science Journey

I have been working in the data science area for more than a decade and would like to share my insights with all the aspiring data scientists about the pitfalls/ what not to do? I will publish a series of five posts on conceptual and technical pitfalls every data scientist should avoid. This is my first post, and before you read this, I would like to let you know all findings mentioned in this post are purely based on my experience.

The first article will be focused on the “Understanding the problem statement”?

First Pitfall:?As soon as the business or end-user shares the problem statement, a data scientist jumps to understand the data required to run the machine learning model

We have spent so much time abstracting the algorithm from the data that we have believed that the first step should analyze the data to ensure the model's feasibility. I believe it also gives some level of confidence to data scientists about the data and domain. However, I believe the primary aim of data scientists should invest more time in understanding the latent need for business ??

Here is an example :

I remember at an early stage of my career one end-users came to me to predict the product satisfaction score and its drivers based on the surveys, so instead of doing some data exploration I asked him some more questions to understand the data, for example

1) How do you use the survey data today?

2) What insights would you like to get out of the survey data?

3) Do you have any hidden interest in understanding customers?

4) What are the basic assumptions of the business about the customers based on survey data?

After a series of meetings and deeper conversations, I could understand that the end-user was looking at survey data to understand why one of the newly launched products are not selling as expected, and how can he/she resolve the challenge so that investment could be done in right areas to increase the product sales?

This understanding completely changed the way we looked at the data, it triggered us to look at the challenges with the newly launched product based on the survey data, rather than predict the product satisfaction score and its drivers

After understanding the problem statement, data scientists felt confident to look for the right data sources and success metrics.

Summary: Data scientists should spend more time understanding the latent need for business before exploring the data.

Second pitfall: Not involving the domain expert

Most data scientists believe they can extract the data, and understand the data without a domain expert, and could initial insight to solve the problem. However, they forgot that business or domain experts have been working with data for years, and they know the meaning of each attribute in the data.

Here is an example

When I was working in the consulting industry, one client asked us to help them reduce the returns, as each return was incurring a higher cost and it has a huge effect on the margin and brand image. One of my data scientists got excited. The DS extracted all the sales data and created basic correlation metrics, data summary, and deep analysis across years on the return types. DS invested another three days to create a discovery deck. DS connected with one of the domain experts on the client's side and shared a deck with him. The domain expert looks at the first two slides and realized that the return types mentioned in the analysis were not the correct ones, but a proxy one used by end-users when they do not know the real reason for the returns. The entire analysis was rejected after looking at the first two slides.

Summary: Doing even basic exploration of data without involving domain experts will only result in understanding the basic stats of the data, but not the meaning of the data.

Third pitfall: Not looking at the existing solution and baseline

One of the most exciting parts of the data science projects is solving the problem and seeing its impact on the business or business process. However, many times DS forgets that the business team already has an existing workaround and someone in the business knows how much improvement is possible.

A data scientist must spend a few days looking at the existing solution, research articles, and current workaround from the business to understand the baseline and jump-start the insight discovery journey.?

Here is an example

I was just onboarded to a newer domain of optics manufacturing and my responsibility was to work with the sales rep, regional head, and data scientist to build a solution that can provide a realistic prediction for the upcoming quarter. As usual, as soon as we saw the data, the data scientist jumped to the conclusion to build the time series-based forecasting. However, soon we realize we could predict none of the quarters from the test data set, which forced us to re-look at the problem, and we realized that there is an existing solution that looked at a different dimension of orders and customers; customer type, order cycles, order size, product types, seasonality, and recency of the customers. It helped us to understand the features, and baseline well.

Sometimes the domain experts and data experts are the same, but most of the time I have seen the domain expert are the one who understands the context of data and its use, and data experts are the one who knows the data collection process and the real meaning of attributes.

Summary:?Data scientists should investigate time in looking at the existing solution or research articles to understand the baseline, different approaches, and minimum features related to the problem.?

There are many other pitfalls, but the above pitfalls are the major ones while understanding the problem statement. Some of the other pitfalls are:

1) Talking about scaling in the first meeting

2) Not defining the right cost metrics

3) Assuming the existing framework will

No alt text provided for this image

Let me know what do you think? Please add your experience or points that you believe will add aspiring data scientists to be successful while creating the solution.

Thanks to Serg Masis and Ernesto Martinez for collaborating with me to publish this article.

Raza Sheikh

Data & Digital Architect | Consultant

1 年

Thank you for sharing, Harish! ??

回复
Nitin Dharmadhikari

Engineering Manager, John Deere | Data Analytics | Product Development | IT Portfolio Manager | EX-GE | IIMC | IITM

2 年

Very well captured Harish. This is quite useful for analytics practitioners like me. Thanks for sharing.

回复
Alissa Kriss, PhD

Digital Biological Assessment Lead at Syngenta

2 年

Wonderful content Harish! This is valuable not only for aspiring Data Scientists, but also Leaders looking to form new Data Science teams as a key message I see is those teams need close connection to ‘the business’ and to domain and data-in-domain experts.

回复
Monika Trehan

Business Development, Corporate Growth Strategy, Innovation & Alliances, Digital Transformation

2 年

Insightful!

Amar Singh

Thermal Expert (20+ Years) with passion for EVs | Electric Powertrain | Battery & Emachine Thermal Management | Computational Fluid Dynamics (CFD)

2 年

Thank you Harish for sharing these valuable insights.

回复

要查看或添加评论,请登录

Harish Kingre的更多文章

社区洞察

其他会员也浏览了