登录查看更多内容

Exploring the Data Science Life Cycle: Understanding the Data Science Process

Doug Rose

Author | Artificial Intelligence | Data Ethics | Agility

发布日期: 2024年11月5日

A project lifecycle can be a useful tool for structuring the process that a team follows. (A lifecycle is a repeating series of steps taken to develop a product, solve a problem, or engage in continuous improvement.) It functions as a high-level map to keep teams moving in the right direction. Although data science teams are less goal-oriented than more traditional teams, they too can benefit from the direction provided by a project lifecycle. However, traditional project lifecycles are not conducive to the work of data science teams.

In this newsletter, I discuss two more traditional project lifecycles and explain why they are a poor fit for data science "projects." I then present a data science life cycle that is more conducive to the exploratory nature of data science.

The Software Development Life Cycle (SDLC)

The software development lifecycle (SDLC) has six phases as shown below. Under each phase is an example of an activity that occurs during that phase. This is typically called the waterfall model because each one of these phases has to be complete before the next can begin.

SDLC works well for software development because these projects have a clearly defined scope (requirements), a relatively linear process, and a tangible deliverable — the software. However, this same lifecycle is poorly suited for data science, which has a very broad scope, a creative and often chaotic process, and a relatively intangible deliverable — knowledge and insight.

The Cross Industry Standard Process for Data Mining (CRISP-DM)

The Cross Industry Standard Process for Data Mining (CRISP-DM) lifecycle, which is used for data instead of software, is considerably more flexible than the waterfall model. It also has six phases, as shown below. The various phases aren't necessarily sequential, and the process continues after deployment, because learning sparks more questions that require further analysis.

CRISP-DM works much better for data science than does SDLC, but, like SDLC, it is still designed for big-bang delivery — deployment. With either model, the data science team is expected to spend considerable time in the early stages — planning and analyzing (for software development) or organizational understanding (for data mining). The goal is to gather as much information as possible at the start. The team is then expected to deliver the goods at the end.

For a data science team to be flexible and exploratory, they can't be forced to adopt a standard lifecycle. A more lightweight approach is necessary to provide the structure necessary while allowing the team to be flexible and shift direction when appropriate.

The Data Science Life Cycle (DSLC)

The fact that traditional project lifecycles are not a good match for data science doesn't mean that data science teams should have complete operational freedom. These life cycles are valuable for structuring the team's activities. With a general sense of the path forward, the team at least has a starting point and some procedures to follow. A good lifecycle is like a handrail; it's there to provide support, but it's not something you need to cling to.

The approach that seems to work best for data science teams is the data science life cycle (DSLC), as shown below. This process framework, based loosely on the scientific method, is lightweight and less rigid than SDLC and CRISP-DM.

Like the two project life cycles presented earlier in this post, DSLC consists of six stages:

Identify: the roles or key players, such as customers, suppliers, or vendors.
Question: the data. In other words, ask questions about the identity of the key players; for example: Which influencers are most responsible for persuading others to purchase our products? or What customer behaviors predict a probable purchase?
Research: the data to find answers to the questions or to challenge any assumptions about the players and their characteristics, circumstances, or behaviors. Research may, for example, focus on correlation or cause and effect.
Results: Create your initial reports to communicate and discuss early findings with the team. These are quick and dirty reports shared only among team members and perhaps a few others involved in the project that may trigger additional questions and research or even convince the team to change direction.
Insight: After several rounds of questioning the data, researching, and reporting, your team steps back to identify any insights the team gained from the process.
Learn: Bundle the team's insights to create a body of organizational knowledge. It is at this point that the team develops a story to tell and uses data visualizations to support it. This new knowledge is what really adds value to the rest of the organization. If you tell a compelling story, it may change the organization's overall strategy or the way it conducts business.

Looping through Questions

DSLC isn't always or even usually a linear, step-by-step process. The data science team should cycle through the questions, research, and results, as shown below, whenever necessary to gain clarity.

Some organizations that have strong data science teams already follow this approach. For example, the video subscription service Netflix data science team looked at what customers were watching, ratings of shows, what plots viewers liked, and the popular actors.

The Netflix team used data science to develop new ideas for shows. They created a predictive model based on analysis of viewer demand. They worked to cycle through questions, research, and results. They then created stories of what their customers would like to see.

This cycle of question, research, and results drives insights and knowledge. The data science team loops through these areas as part of the larger DSLC. Remember to not think of this lifecycle as a waterfall process. Instead, think of it as a few steps to start and then a cycle in the middle to churn out great stories at the end.

Frequently Asked Questions

What is the data science life cycle?

The data science life cycle is a structured approach to solving data-related problems by following a series of steps. This includes stages like data collection, data preparation, data exploration, data modeling, and data validation. Each phase in the data science lifecycle is crucial for building a successful data science product.

What are the primary steps of a data science project?

The primary steps of a data science project typically include problem definition, data collection, data cleaning and preprocessing, exploratory data analysis, building a data model, model evaluation, and deployment. These steps of the data science project help ensure that relevant data is captured and analyzed properly.

How does a data scientist collect data for a project?

A data scientist collects data from various sources such as databases, APIs, web scraping, and more. The collected data then undergoes data cleaning to remove any noise, inconsistencies, and to prepare the data for further analysis.

What is the role of exploratory data analysis in the data science process?

Exploratory data analysis (EDA) is a phase in the data science process where data scientists analyze datasets to summarize their main characteristics. It helps in understanding data distributions, identifying patterns, spotting anomalies, and testing hypotheses using statistical and graphical techniques.

How important is data preparation in the data science life cycle?

Data preparation is one of the most critical steps in the data science life cycle. It involves data cleaning and transforming raw data into a format suitable for analysis. Without data preparation, the quality of the data model and subsequent results can be severely impacted.

What are common challenges faced during the data investigation phase?

Common challenges in the data investigation phase include dealing with incomplete or missing data, handling data drift, and ensuring data quality. A data scientist must employ various techniques to address these issues to obtain a reliable study of data.

What tools are commonly used for data exploration and preparation?

Popular tools for data exploration and preparation include Python libraries like Pandas, NumPy, and Scikit-learn. Data science with Python is particularly popular due to its extensive range of libraries that support data manipulation, visualization, and machine learning.

What is the importance of building a data model in data science?

Building a data model is essential in data science as it helps predict outcomes and generate insights from data. A well-built data model can transform raw data into actionable information, aiding decision-making and strategy formulation. Data models are crucial for turning historical data and new data into practical solutions.

How does a data science professional validate a data model?

A data science professional validates a data model by splitting the data into training and testing sets, employing cross-validation techniques, and using performance metrics such as accuracy, precision, recall, and F1-score. This ensures that the data model performs well on unseen data and is robust.

This is my weekly newsletter that I call The Deep End because I want to go deeper than results you’ll see from searches or AI, incorporating insights from the history of data and data science. Each week I’ll go deep to explain a topic that’s relevant to people who work with technology. I’ll be posting about artificial intelligence, data science, and data ethics.?

This newsletter is 100% human written ?? (* aside from a quick run through grammar and spell check).

More Sources:

The Deep End

54,047 位关注者

Mojgan Taheri

Turning Data into Insights | Business Intelligence Specialist | Microsoft Certified | Certified in Data Science

4 个月

Great insights on the importance of a structured approach in data science!?Emphasizing flexibility in stages like data preparation and exploratory analysis is key to unlocking the full potential of our data.

1 次回应

要查看或添加评论，请登录

Doug Rose的更多文章

Artificial Neural Network Model Classification and Regression

2025年3月20日

Artificial Neural Network Model Classification and Regression

Unlike human beings who often learn for the intrinsic value of knowing something, machine-learning is almost always…

4 条评论
Understanding Weights and Bias in Artificial Neural Networks

2025年3月18日

Understanding Weights and Bias in Artificial Neural Networks

An artificial neural network is a machine learning system made up of numerous interconnected neurons arranged in layers…

5 条评论
The Power of Data Clustering in Machine Learning

2025年3月13日

The Power of Data Clustering in Machine Learning

There are three types of problems that machine learning is generally used to solve: Classification Regression…

2 条评论
The Neural Network Chain Rule

2025年3月11日

The Neural Network Chain Rule

Backpropagation is a machine-learning technique used to calculate the gradient of the cost function at output and…

2 条评论
Artificial intelligence and Organizations

2025年3月6日

Artificial intelligence and Organizations

Artificial intelligence and organizations are not always a great fit. While many organizations use artificial…

8 条评论
Fine-Tuning Neural Networks for Deep Learning: Classification with Data Science

2025年3月4日

Fine-Tuning Neural Networks for Deep Learning: Classification with Data Science

You've already seen that an artificial neural network can use backpropogation to help it adjust itself when the network…

4 条评论
Backpropagation in Artificial Neural Networks

2025年2月27日

Backpropagation in Artificial Neural Networks

An artificial neural network requires several components to drive its learning, including: Artificial neurons: Commonly…

6 条评论
Gradient Descent and Backpropagation in Artificial Neural Networks

2025年2月25日

Gradient Descent and Backpropagation in Artificial Neural Networks

Machine learning requires the use of a cost function along with gradient descent. As the machine learns to perform a…

2 条评论
A Deep Dive into Ensemble Algorithms and Combining Multiple Models

2025年2月20日

A Deep Dive into Ensemble Algorithms and Combining Multiple Models

There are several commonly used machine learning algorithms and it's difficult to choose the right one based on the use…

3 条评论
Understanding the Importance of Artificial Neural Network Weights and Bias in Deep Learning

2025年2月18日

Understanding the Importance of Artificial Neural Network Weights and Bias in Deep Learning

Artificial neural networks learn through a combination of functions, weights, and biases. Each artificial neuron…

1 条评论

See all articles

The Software Development Life Cycle (SDLC)

The Cross Industry Standard Process for Data Mining (CRISP-DM)

The Data Science Life Cycle (DSLC)

Looping through Questions

Frequently Asked Questions

What is the data science life cycle?

What are the primary steps of a data science project?

How does a data scientist collect data for a project?

What is the role of exploratory data analysis in the data science process?

How important is data preparation in the data science life cycle?

What are common challenges faced during the data investigation phase?

What tools are commonly used for data exploration and preparation?

What is the importance of building a data model in data science?

How does a data science professional validate a data model?

More Sources:

The Deep End

54,047 位关注者

Doug Rose的更多文章

Artificial Neural Network Model Classification and Regression

Understanding Weights and Bias in Artificial Neural Networks

The Power of Data Clustering in Machine Learning

The Neural Network Chain Rule

Artificial intelligence and Organizations

Fine-Tuning Neural Networks for Deep Learning: Classification with Data Science

Backpropagation in Artificial Neural Networks

Gradient Descent and Backpropagation in Artificial Neural Networks

A Deep Dive into Ensemble Algorithms and Combining Multiple Models

Understanding the Importance of Artificial Neural Network Weights and Bias in Deep Learning