Why Agile Doesn’t Work for Data Science and How DSLP Fills the Gap
Diogo Ribeiro
Lead Data Scientist and Research - Mathematician - Invited Professor - Open to collaboration with academics
In software engineering, Agile methodologies have been widely adopted to ensure that teams can adapt to evolving client needs. However, in data science, the nature of work is fundamentally different. Research-based projects, such as data science, often don’t have a clear end-product at the start, and methodologies like Agile, designed for software development, struggle to keep up with the fluid nature of data science. As a solution, the Data Science Lifecycle Process (DSLP) emerges as a more fitting project management framework for teams focusing on data science.
You’ve Probably Tried Agile…
Many data science teams, including my own, have experimented with Agile methodologies. Standups, Kanban boards, sprints — we’ve all been there. At first, these tools may seem to work, but over time, the effectiveness diminishes. Standups become repetitive, boards are left unmaintained, and sprints lose their meaning. This creates a sense of frustration, where the process no longer serves its purpose, leaving teams unsure of how to improve it.
Why Agile Doesn’t Work for Data Science
The core problem is that Agile frameworks were built for software engineering, where the objective is to deliver a product. Agile is designed to align developers with the changing requirements of end-users, maintaining a feedback loop that helps adapt the product incrementally. Daily standups and short sprints are crucial for keeping everyone aligned and blockers identified.
However, data science is inherently a research and development (R&D) process. At the outset of a project, there’s often no clear definition of the end-product. Much of the project involves experimentation and discovery: determining what data is required, how it should be preprocessed, and what modeling techniques are suitable. The product only becomes apparent after significant exploration and experimentation. By the time a data science project reaches the stage of productionizing a model, Agile’s feedback and sprint cycles feel redundant, as the real "end-product" only emerges late in the process.
Agile is better suited for the final steps of a data science project — productionizing the model — but falls short in managing the research and discovery phases. This is where DSLP comes into play.
The Data Science Lifecycle Process (DSLP)
DSLP is a structured approach to managing the full lifecycle of a data science project. Unlike Agile, which is product-focused from the outset, DSLP recognizes that data science is an iterative research-driven process. It breaks the project into clear stages, each reflecting the unique nature of data science workflows.
Using DSLP, my team saw immediate improvements in our ability to:
The structure of DSLP is flexible but focused, allowing teams to iteratively improve while maintaining clarity and direction. Below, I’ll break down the five key stages of DSLP and how each one contributes to a more organized data science workflow.
The Five Steps of DSLP
DSLP consists of five core stages: Ask, Data, Explore, Experiment, and Model. Each of these stages has a corresponding GitHub Issue type, providing a structured way to document and track the project’s progress.
1. Ask
The Ask stage captures the initial problem that the team aims to solve. This is where the team defines and scopes the value-based problem, and it serves as an anchor for the rest of the project. The issue raised at this stage becomes a live document, updated throughout the project, containing all work that has been done or is in progress. It’s also the primary reference point for anyone new to the project.
领英推荐
2. Data
The Data stage is where the team focuses on gathering and preparing the necessary datasets. Data Issues are created to collaborate on data collection, preprocessing, and curation. This stage ensures that all team members are aligned on the data being used, and it provides a single place to document any transformations or sourcing decisions made along the way.
3. Explore
In the Explore stage, the team conducts exploratory data analysis (EDA) to understand the data better. This is akin to the early analysis that data scientists often do to discover patterns, relationships, and potential insights within the data. The Explore Issues created here facilitate collaboration and documentation of what was analyzed, ensuring that everyone has access to the findings.
4. Experiment
The Experiment stage is where the team begins testing different models, approaches, and techniques. Experiment Issues track various methodologies and capture the results of each experiment. This stage allows the team to try different models, adjust hyperparameters, and frame the problem in different ways, all while keeping a detailed log of what works and what doesn’t.
5. Model
The final stage, Model, focuses on productionizing the successful experiments. Model Issues document the steps taken to prepare the model for deployment, including testing, pipeline creation, and monitoring. This stage ensures that all production-level tasks, such as writing tests or setting up monitoring, are tracked and completed systematically.
Example Project: Predicting Customer Churn
To demonstrate how DSLP functions in a real-world scenario, let's consider a project aimed at predicting customer churn. Here’s how DSLP would guide the team through the different phases:
A Kanban Board That Makes Sense for Data Science
One of the most practical aspects of DSLP is its seamless integration with Kanban boards. Each stage in the lifecycle can be represented by a corresponding column on a Kanban board, from the initial Ask to the final Model. This structure allows teams to track their progress clearly and ensures that all tasks — from exploratory analysis to production — are accounted for.
This approach to managing data science projects not only ensures a smooth workflow but also reduces the friction often caused by traditional Agile frameworks.
Final Thoughts
In contrast to Agile, which falls short in managing data science projects, DSLP provides a structured yet flexible framework tailored to the research-driven nature of data science. By focusing on the unique needs of each stage in the data science process — from asking the right questions to experimenting and productionizing models — DSLP allows teams to stay organized and productive, all while fostering collaboration and clarity.
If your data science projects have been struggling to fit into Agile methodologies, consider adopting DSLP. It’s a methodology that understands the complexities of data science and adapts to them, rather than forcing data science to adapt to software development processes.