登录查看更多内容

18 steps to apply Agile in Data Science

Gianmario Spacagna

Principal AI Engineer @ Moonsong Labs | Web3 and Decentralized AI | Consultant | Agents Development

发布日期: 2015年11月12日

Introduction

It is a very common pattern in software development to start a new project in a highly uncertain and chaotic scenario surrounded by plenty of ideas of what features we might want to implement. In Data Science the problem is even more amplified by its nondeterministic nature. In the start-up of a Data Science project we not just don’t know what we are trying to implement, we also don’t know how to implement it and also under which circumstances that would be possible and correct.

This initial lack of structure often is manifested by an initial spike of unnecessary development and later in the project in the form of technical debts and unexplained inconsistencies. You might spend a lot of resources before to find out that the delivered solution simply does not fit the business nature of the problem.

In Agile Data Science the goal should not be producing charts and reports or hacky scripts calling some machine learning library. In Agile Data Science we want to iteratively build production-quality applications that solve the true business needs by extracting hidden knowledge from the data.

This is the final summarising post of the Agile Data Science Iteration 0 series:

The problem definition and the evaluation strategy
The initial investigation
The simple solution
The ETL
The Hypothesis-Driven Analysis (HDA)
* The complete checklist <=

The Complete Checklist

Rigorous definition of the business problem we are attempting to solve and why it is important
Define your objective acceptance criteria
Develop the validation framework (ergo, the acceptance test)
Stop thinking, start googling!
Gather initial dataset into proper infrastructure
Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
Define and quickly develop the simplest solution to the problem
Release/demo first basic solution
Research of background for ways to improve the basic solution
Gather additional data into proper infrastructure (if required)
Ad-hoc Exploratory Data Analysis (EDA)
Propose better solution minimising potential risks and marginal gain
Develop the Data Sanity check
Define the Data Types of your application domain
Develop the ETL and output the normalised data into a proper infrastructure
Clearly state all of the assumptions/hypothesis and document whether they have been verified or not and how they can be verified
Develop the automated Hypothesis-Driven Analysis (HDA) consisting of hypothesis validation + statistics summary, on top of the normalised data
Analyse the output of the automated HDA to adjust/revise the proposed solution

At the end of the Iteration 0 you have a very solid starting point for your project and you could now follow the typical Agile development cycle, whether you prefer more SCRUM, Kanban, a mix of them or your ad-hoc custom methodology.

Regardless of if you want to use a strict or flexible workflow, keep in mind that the main difference with the Agile iterations for software development consist in the fact that a ticket is typically broad and open-ended. You should not be surprised if the majority of your tickets get then split into multiple sub-tickets after the initial investigation of the problem. You should allow to create subtasks even after the sprint planning. In some cases you may prefer to mark them as blockers and re-scope them into the next sprint or in other cases you want to allow them to affect the current sprint.
What is important is that you should start implementing production-quality code only when the requirements and the acceptance test are well defined. In Data Science this not very likely to happen all the time. Every time you are presented with an open problem to investigate and solve you should try to break it into research/analysis and development subtasks.

What not to do?

Do not start any development without have done a prior detailed research/investigation
Do not just deliver analysis code in notebooks, after your investigation move the code into production-quality standards
Do not blindly trust external libraries or APIs if you don't know exactly what they do and return, run some tests if needed
Do not generate manual reports of your finding until the experiments are reproducible and automated
Do not deploy any model if all of the assumptions haven't been stated and verified
Do not be lazy to learn better technologies and methodologies!

To conclude, in this series of posts I just wanted to share some of my experience on starting new Data Science projects and common problems that I have seen to be addressed in a confusional and chaotic way. I hope that by following those guidelines you can reduce the technical debts of the project and the risk of working several months without never delivering a correct and working solution.

More details of the Agile cycle for Data Science applications and in particular how to time-box open-ended questions will be covered into another post. Stay tuned and get ready to run!

***

The Hypothesis-Driven Analysis << prev

Gawain L.

6 年

Great article ! Do you use Kanban to visualize your workstates and workflow in your DS projects? Do you have an example?

Ramesh Kumar

Senior Solution Architect

9 年

Good One

Aliaksei Yurkevich

Analytics Leadership | Data Governance & ML

9 年

true story

查看更多评论

要查看或添加评论，请登录

Gianmario Spacagna的更多文章

Are tech leaders taking full advantage of modern cloud architectures to ease the transition from AI research to an enterprise product?

2020年4月30日

Are tech leaders taking full advantage of modern cloud architectures to ease the transition from AI research to an enterprise product?

I recently came across an article by Chris Samiullah about How to Deploy Machine Learning Models and I was glad to…
Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

2018年2月6日

Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

I published on GitHub a tutorial on how to implement an algorithm for predictive maintenance using survival analysis…

4 条评论
Demystifying Data Science

2017年9月18日

Demystifying Data Science

On June the 7th I had a quick introductory talk at AssoLombarda in Milan regarding the role of Data Scientist into the…
Lessons learnt building data-driven production systems at Barclays

2016年3月31日

Lessons learnt building data-driven production systems at Barclays

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful…

1 条评论
The Jira ScrumBan board

2016年3月30日

The Jira ScrumBan board

In this post I want to share one of the core tool of the agile workflow that we use at the Advanced Analytics team at…

1 条评论
Functional Data Validation using monads and applicative functors

2016年3月9日

Functional Data Validation using monads and applicative functors

ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data…
Logical Warehouse for Data Science: map raw relational tables into Spark in-memory with Tachyon

2016年2月18日

Logical Warehouse for Data Science: map raw relational tables into Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are: Data stored in non…
Python, Scala or PyScala? How to chose the best Data Science language for Apache Spark

2016年1月28日

Python, Scala or PyScala? How to chose the best Data Science language for Apache Spark

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics…

4 条评论
WordPress Blog Posts Recommender in Spark, Scala and SparkNotebook

2015年11月19日

WordPress Blog Posts Recommender in Spark, Scala and SparkNotebook

At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use…
The Hypothesis-Driven Analysis - Agile Data Science Iteration 0

2015年11月10日

The Hypothesis-Driven Analysis - Agile Data Science Iteration 0

This is the fifth post of the Agile Data Science Iteration 0 series: The Evaluation Strategy The Initial Investigation…

See all articles

18 steps to apply Agile in Data Science

Gianmario Spacagna

Principal AI Engineer @ Moonsong Labs | Web3 and Decentralized AI | Consultant | Agents Development

Introduction

The Complete Checklist

What not to do?

Gianmario Spacagna的更多文章

社区洞察

其他会员也浏览了

Applying AI in Agile Software Development Part 1: What I Would and Wouldn’t Do

Juggling the Future: Managing AI, Data Science, and Emerging Tech Projects

Agile Data Science: Principles, Methodologies, Process

Exploring the AI Project Landscape: Learning from Failures and Achieving Success

Sharing my experiential insights on using Agile & AI to "Develop a Model for Predicting Stock Market Fluctuations"

MLOps Process: An Overview

Comprehensive Agile and Modern Strategy Plan for Software Engineering Company

AI-gile (or WAIgile)?

January 29, 2021

A modern platform and a SCRUM Agile team to deliver Embedded BI

Introduction

The Complete Checklist

What not to do?

Gianmario Spacagna的更多文章

Are tech leaders taking full advantage of modern cloud architectures to ease the transition from AI research to an enterprise product?

Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

Demystifying Data Science

Lessons learnt building data-driven production systems at Barclays

The Jira ScrumBan board

Functional Data Validation using monads and applicative functors

Logical Warehouse for Data Science: map raw relational tables into Spark in-memory with Tachyon

Python, Scala or PyScala? How to chose the best Data Science language for Apache Spark

WordPress Blog Posts Recommender in Spark, Scala and SparkNotebook

The Hypothesis-Driven Analysis - Agile Data Science Iteration 0

社区洞察

其他会员也浏览了

Applying AI in Agile Software Development Part 1: What I Would and Wouldn’t Do

Juggling the Future: Managing AI, Data Science, and Emerging Tech Projects

Agile Data Science: Principles, Methodologies, Process

Exploring the AI Project Landscape: Learning from Failures and Achieving Success

Sharing my experiential insights on using Agile & AI to "Develop a Model for Predicting Stock Market Fluctuations"

MLOps Process: An Overview

Comprehensive Agile and Modern Strategy Plan for Software Engineering Company

AI-gile (or WAIgile)?

January 29, 2021

A modern platform and a SCRUM Agile team to deliver Embedded BI