登录查看更多内容

Why Agile Doesn’t Work for Data Science and How DSLP Fills the Gap

Diogo Ribeiro

Lead Data Scientist and Research - Mathematician - Invited Professor - Open to collaboration with academics

发布日期: 2024年9月5日

In software engineering, Agile methodologies have been widely adopted to ensure that teams can adapt to evolving client needs. However, in data science, the nature of work is fundamentally different. Research-based projects, such as data science, often don’t have a clear end-product at the start, and methodologies like Agile, designed for software development, struggle to keep up with the fluid nature of data science. As a solution, the Data Science Lifecycle Process (DSLP) emerges as a more fitting project management framework for teams focusing on data science.

You’ve Probably Tried Agile…

Many data science teams, including my own, have experimented with Agile methodologies. Standups, Kanban boards, sprints — we’ve all been there. At first, these tools may seem to work, but over time, the effectiveness diminishes. Standups become repetitive, boards are left unmaintained, and sprints lose their meaning. This creates a sense of frustration, where the process no longer serves its purpose, leaving teams unsure of how to improve it.

Why Agile Doesn’t Work for Data Science

The core problem is that Agile frameworks were built for software engineering, where the objective is to deliver a product. Agile is designed to align developers with the changing requirements of end-users, maintaining a feedback loop that helps adapt the product incrementally. Daily standups and short sprints are crucial for keeping everyone aligned and blockers identified.

However, data science is inherently a research and development (R&D) process. At the outset of a project, there’s often no clear definition of the end-product. Much of the project involves experimentation and discovery: determining what data is required, how it should be preprocessed, and what modeling techniques are suitable. The product only becomes apparent after significant exploration and experimentation. By the time a data science project reaches the stage of productionizing a model, Agile’s feedback and sprint cycles feel redundant, as the real "end-product" only emerges late in the process.

Agile is better suited for the final steps of a data science project — productionizing the model — but falls short in managing the research and discovery phases. This is where DSLP comes into play.

The Data Science Lifecycle Process (DSLP)

DSLP is a structured approach to managing the full lifecycle of a data science project. Unlike Agile, which is product-focused from the outset, DSLP recognizes that data science is an iterative research-driven process. It breaks the project into clear stages, each reflecting the unique nature of data science workflows.

Using DSLP, my team saw immediate improvements in our ability to:

Document our research and decisions throughout the project lifecycle.
Smoothly transfer knowledge between team members, removing friction during handovers.
Collaborate more effectively, especially during exploratory and experimental phases.
Prioritize tasks more intelligently, avoiding ill-defined work.
Seamlessly integrate into existing Kanban workflows without significant changes.

The structure of DSLP is flexible but focused, allowing teams to iteratively improve while maintaining clarity and direction. Below, I’ll break down the five key stages of DSLP and how each one contributes to a more organized data science workflow.

The Five Steps of DSLP

DSLP consists of five core stages: Ask, Data, Explore, Experiment, and Model. Each of these stages has a corresponding GitHub Issue type, providing a structured way to document and track the project’s progress.

1. Ask

The Ask stage captures the initial problem that the team aims to solve. This is where the team defines and scopes the value-based problem, and it serves as an anchor for the rest of the project. The issue raised at this stage becomes a live document, updated throughout the project, containing all work that has been done or is in progress. It’s also the primary reference point for anyone new to the project.

Andy Forbes 11 个月前

Driving Collaborative Agile Delivery for Multi-Team…

Dimitris S. 1 个月前

To be agile, or not to be (ML4Devs Newsletter, Issue 3)

Satish Chandra Gupta 2 年前

2. Data

The Data stage is where the team focuses on gathering and preparing the necessary datasets. Data Issues are created to collaborate on data collection, preprocessing, and curation. This stage ensures that all team members are aligned on the data being used, and it provides a single place to document any transformations or sourcing decisions made along the way.

3. Explore

In the Explore stage, the team conducts exploratory data analysis (EDA) to understand the data better. This is akin to the early analysis that data scientists often do to discover patterns, relationships, and potential insights within the data. The Explore Issues created here facilitate collaboration and documentation of what was analyzed, ensuring that everyone has access to the findings.

4. Experiment

The Experiment stage is where the team begins testing different models, approaches, and techniques. Experiment Issues track various methodologies and capture the results of each experiment. This stage allows the team to try different models, adjust hyperparameters, and frame the problem in different ways, all while keeping a detailed log of what works and what doesn’t.

5. Model

The final stage, Model, focuses on productionizing the successful experiments. Model Issues document the steps taken to prepare the model for deployment, including testing, pipeline creation, and monitoring. This stage ensures that all production-level tasks, such as writing tests or setting up monitoring, are tracked and completed systematically.

Example Project: Predicting Customer Churn

To demonstrate how DSLP functions in a real-world scenario, let's consider a project aimed at predicting customer churn. Here’s how DSLP would guide the team through the different phases:

Ask: The goal is to predict which customers are likely to churn (leave the service) in the next quarter. The scope includes identifying at-risk customers while aiming to maximize the model's recall without sacrificing precision. Clear business goals are established, such as reducing churn rates by a specific percentage through targeted interventions.
Data: The team gathers customer interaction data, including usage patterns, transaction history, customer service interactions, and demographic information. Preprocessing is done to handle missing values, standardize data formats, and create relevant time-series features for each customer.
Explore: In this phase, the team performs exploratory data analysis (EDA) to discover trends and patterns in customer behavior. Key factors that correlate with churn, such as reduced product usage or increased customer service complaints, are identified. These findings are documented in Explore Issues to ensure the entire team understands the drivers behind customer churn.
Experiment: Multiple machine learning models (e.g., logistic regression, random forests, XGBoost) are trained and evaluated to predict churn. The team experiments with different feature engineering techniques, such as calculating customer lifetime value or segmenting customers based on behavior. Performance metrics, such as precision, recall, and F1-score, are tracked and discussed in Experiment Issues to compare model performance.
Model: After determining the most effective model, the team works on deploying it into production. This includes setting up a real-time pipeline for incoming data, designing alerts for high-risk customers, and establishing monitoring processes to ensure the model’s predictions remain accurate as new data comes in.

A Kanban Board That Makes Sense for Data Science

One of the most practical aspects of DSLP is its seamless integration with Kanban boards. Each stage in the lifecycle can be represented by a corresponding column on a Kanban board, from the initial Ask to the final Model. This structure allows teams to track their progress clearly and ensures that all tasks — from exploratory analysis to production — are accounted for.

This approach to managing data science projects not only ensures a smooth workflow but also reduces the friction often caused by traditional Agile frameworks.

Final Thoughts

In contrast to Agile, which falls short in managing data science projects, DSLP provides a structured yet flexible framework tailored to the research-driven nature of data science. By focusing on the unique needs of each stage in the data science process — from asking the right questions to experimenting and productionizing models — DSLP allows teams to stay organized and productive, all while fostering collaboration and clarity.

If your data science projects have been struggling to fit into Agile methodologies, consider adopting DSLP. It’s a methodology that understands the complexities of data science and adapts to them, rather than forcing data science to adapt to software development processes.

Why Agile Doesn’t Work for Data Science and How DSLP Fills the Gap

Diogo Ribeiro

Lead Data Scientist and Research - Mathematician - Invited Professor - Open to collaboration with academics

You’ve Probably Tried Agile…

Why Agile Doesn’t Work for Data Science

The Data Science Lifecycle Process (DSLP)

The Five Steps of DSLP

1. Ask

领英推荐

2. Data

3. Explore

4. Experiment

5. Model

Example Project: Predicting Customer Churn

A Kanban Board That Makes Sense for Data Science

Final Thoughts

更多精彩文章

社区洞察

其他会员也浏览了

How my teams had combined Agile & AI to build a robust "DevX" platform in IT

Agile Analytics

Food for Agile Thought #228: Asking Good Questions, Product Catastrophes, Agile’s Cognitive Science, Trade-offs & Debts

Implementing Agile in Data Science Projects and Engagements

Agile POD: The Next Big Thing in Software Development Outsourcing

Agile Metrics: Fake it, Fake it … till you … realize you would never make it!

The Importance of Agile Methodology in Data Teams

The agile company

The Evolution of Agile: What's Next for Developers?

WHAT IS AGILE

You’ve Probably Tried Agile…

Why Agile Doesn’t Work for Data Science

The Data Science Lifecycle Process (DSLP)

The Five Steps of DSLP

1. Ask

领英推荐

2. Data

3. Explore

4. Experiment

5. Model

Example Project: Predicting Customer Churn

A Kanban Board That Makes Sense for Data Science

Final Thoughts

Interpreting the Intercept in Regression Models

2024年11月8日

Exploring Logistic Regression Models

2024年11月1日

Making Sense of Statistical Terms: A Guide to Skewness, Variance, and More

2024年10月30日

Who Can Truly Fix Post-Deployment Issues with ML Models?

2024年10月25日

A/B Testing: The Key to Data-Driven Decision Making

2024年10月22日

Choosing the Right Statistical Test: A Practical Guide for Data-Driven Decision Making

2024年10月19日

Why Multiple Imputation is Indefensible for Handling Missing Data

2024年10月18日

Rust in Data Science: Is It the Next Frontier?

2024年10月18日

Is JavaScript the Future of Data Science? Exploring Its Role in the Data Science

2024年10月9日

Apache Flink: Real-Time Data Processing at Scale

2024年10月7日

社区洞察

其他会员也浏览了

How my teams had combined Agile & AI to build a robust "DevX" platform in IT

Agile Analytics

Food for Agile Thought #228: Asking Good Questions, Product Catastrophes, Agile’s Cognitive Science, Trade-offs & Debts

Implementing Agile in Data Science Projects and Engagements

Agile POD: The Next Big Thing in Software Development Outsourcing

Agile Metrics: Fake it, Fake it … till you … realize you would never make it!

The Importance of Agile Methodology in Data Teams

The agile company

The Evolution of Agile: What's Next for Developers?

WHAT IS AGILE