Exploring the Data Science Life Cycle: Understanding the Data Science Process

Exploring the Data Science Life Cycle: Understanding the Data Science Process

A project lifecycle can be a useful tool for structuring the process that a team follows. (A lifecycle is a repeating series of steps taken to develop a product, solve a problem, or engage in continuous improvement.) It functions as a high-level map to keep teams moving in the right direction. Although data science teams are less goal-oriented than more traditional teams, they too can benefit from the direction provided by a project lifecycle. However, traditional project lifecycles are not conducive to the work of data science teams.

In this newsletter, I discuss two more traditional project lifecycles and explain why they are a poor fit for data science "projects." I then present a data science life cycle that is more conducive to the exploratory nature of data science.

The Software Development Life Cycle (SDLC)

The software development lifecycle (SDLC) has six phases as shown below. Under each phase is an example of an activity that occurs during that phase. This is typically called the waterfall model because each one of these phases has to be complete before the next can begin.

SDLC works well for software development because these projects have a clearly defined scope (requirements), a relatively linear process, and a tangible deliverable — the software. However, this same lifecycle is poorly suited for data science, which has a very broad scope, a creative and often chaotic process, and a relatively intangible deliverable — knowledge and insight.

The Cross Industry Standard Process for Data Mining (CRISP-DM)


The Cross Industry Standard Process for Data Mining (CRISP-DM) lifecycle, which is used for data instead of software, is considerably more flexible than the waterfall model. It also has six phases, as shown below. The various phases aren't necessarily sequential, and the process continues after deployment, because learning sparks more questions that require further analysis.

CRISP-DM works much better for data science than does SDLC, but, like SDLC, it is still designed for big-bang delivery — deployment. With either model, the data science team is expected to spend considerable time in the early stages — planning and analyzing (for software development) or organizational understanding (for data mining). The goal is to gather as much information as possible at the start. The team is then expected to deliver the goods at the end.

For a data science team to be flexible and exploratory, they can't be forced to adopt a standard lifecycle. A more lightweight approach is necessary to provide the structure necessary while allowing the team to be flexible and shift direction when appropriate.

The Data Science Life Cycle (DSLC)


The fact that traditional project lifecycles are not a good match for data science doesn't mean that data science teams should have complete operational freedom. These life cycles are valuable for structuring the team's activities. With a general sense of the path forward, the team at least has a starting point and some procedures to follow. A good lifecycle is like a handrail; it's there to provide support, but it's not something you need to cling to.

The approach that seems to work best for data science teams is the data science life cycle (DSLC), as shown below. This process framework, based loosely on the scientific method, is lightweight and less rigid than SDLC and CRISP-DM.

Like the two project life cycles presented earlier in this post, DSLC consists of six stages:

  1. Identify: the roles or key players, such as customers, suppliers, or vendors.
  2. Question: the data. In other words, ask questions about the identity of the key players; for example: Which influencers are most responsible for persuading others to purchase our products? or What customer behaviors predict a probable purchase?
  3. Research: the data to find answers to the questions or to challenge any assumptions about the players and their characteristics, circumstances, or behaviors. Research may, for example, focus on correlation or cause and effect.
  4. Results: Create your initial reports to communicate and discuss early findings with the team. These are quick and dirty reports shared only among team members and perhaps a few others involved in the project that may trigger additional questions and research or even convince the team to change direction.
  5. Insight: After several rounds of questioning the data, researching, and reporting, your team steps back to identify any insights the team gained from the process.
  6. Learn: Bundle the team's insights to create a body of organizational knowledge. It is at this point that the team develops a story to tell and uses data visualizations to support it. This new knowledge is what really adds value to the rest of the organization. If you tell a compelling story, it may change the organization's overall strategy or the way it conducts business.

Looping through Questions


DSLC isn't always or even usually a linear, step-by-step process. The data science team should cycle through the questions, research, and results, as shown below, whenever necessary to gain clarity.

Some organizations that have strong data science teams already follow this approach. For example, the video subscription service Netflix data science team looked at what customers were watching, ratings of shows, what plots viewers liked, and the popular actors.

The Netflix team used data science to develop new ideas for shows. They created a predictive model based on analysis of viewer demand. They worked to cycle through questions, research, and results. They then created stories of what their customers would like to see.

This cycle of question, research, and results drives insights and knowledge. The data science team loops through these areas as part of the larger DSLC. Remember to not think of this lifecycle as a waterfall process. Instead, think of it as a few steps to start and then a cycle in the middle to churn out great stories at the end.

Frequently Asked Questions

What is the data science life cycle?

The data science life cycle is a structured approach to solving data-related problems by following a series of steps. This includes stages like data collection, data preparation, data exploration, data modeling, and data validation. Each phase in the data science lifecycle is crucial for building a successful data science product.

What are the primary steps of a data science project?

The primary steps of a data science project typically include problem definition, data collection, data cleaning and preprocessing, exploratory data analysis, building a data model, model evaluation, and deployment. These steps of the data science project help ensure that relevant data is captured and analyzed properly.

How does a data scientist collect data for a project?

A data scientist collects data from various sources such as databases, APIs, web scraping, and more. The collected data then undergoes data cleaning to remove any noise, inconsistencies, and to prepare the data for further analysis.

What is the role of exploratory data analysis in the data science process?

Exploratory data analysis (EDA) is a phase in the data science process where data scientists analyze datasets to summarize their main characteristics. It helps in understanding data distributions, identifying patterns, spotting anomalies, and testing hypotheses using statistical and graphical techniques.

How important is data preparation in the data science life cycle?

Data preparation is one of the most critical steps in the data science life cycle. It involves data cleaning and transforming raw data into a format suitable for analysis. Without data preparation, the quality of the data model and subsequent results can be severely impacted.

What are common challenges faced during the data investigation phase?

Common challenges in the data investigation phase include dealing with incomplete or missing data, handling data drift, and ensuring data quality. A data scientist must employ various techniques to address these issues to obtain a reliable study of data.

What tools are commonly used for data exploration and preparation?

Popular tools for data exploration and preparation include Python libraries like Pandas, NumPy, and Scikit-learn. Data science with Python is particularly popular due to its extensive range of libraries that support data manipulation, visualization, and machine learning.

What is the importance of building a data model in data science?

Building a data model is essential in data science as it helps predict outcomes and generate insights from data. A well-built data model can transform raw data into actionable information, aiding decision-making and strategy formulation. Data models are crucial for turning historical data and new data into practical solutions.

How does a data science professional validate a data model?

A data science professional validates a data model by splitting the data into training and testing sets, employing cross-validation techniques, and using performance metrics such as accuracy, precision, recall, and F1-score. This ensures that the data model performs well on unseen data and is robust.


This is my weekly newsletter that I call The Deep End because I want to go deeper than results you’ll see from searches or AI, incorporating insights from the history of data and data science. Each week I’ll go deep to explain a topic that’s relevant to people who work with technology. I’ll be posting about artificial intelligence, data science, and data ethics.?

This newsletter is 100% human written ?? (* aside from a quick run through grammar and spell check).

More Sources:

  1. https://www.geeksforgeeks.org/data-science-lifecycle/
  2. https://www.institutedata.com/blog/5-steps-in-data-science-lifecycle/
  3. https://www.datascience-pm.com/crisp-dm-2/
  4. https://blogs.oracle.com/analytics/post/what-is-data-preparation-and-why-is-it-important
  5. https://mode.com/blog/data-exploration-tools/
  6. https://www.geeksforgeeks.org/what-is-model-validation-and-why-is-it-important/
  7. https://aspenasolutions.com/challenges-of-data-collection-and-how-to-overcome-them

Mojgan Taheri

Turning Data into Insights | Business Intelligence Specialist | Microsoft Certified | Certified in Data Science

4 个月

Great insights on the importance of a structured approach in data science!?Emphasizing flexibility in stages like data preparation and exploratory analysis is key to unlocking the full potential of our data.

要查看或添加评论,请登录

Doug Rose的更多文章