登录查看更多内容

"The ETL" - Agile Data Science Iteration 0

Gianmario Spacagna

Principal AI Engineer @ Moonsong Labs | Web3 and Decentralized AI | Consultant | Agents Development

发布日期: 2015年11月7日

+ 关注

This is the fourth post of the Agile Data Science Iteration 0 series:

The problem definition and the evaluation strategy
The initial investigation
The simple solution
* The ETL <=
The Hypothesis-Driven Analysis (HDA)
The final planning

Previously

What we have achieved so far (see previous posts above):

Rigorous definition of the business problem we are attempting to solve and why it is important
Define your objective acceptance criteria
Develop the validation framework (ergo, the acceptance test)
Stop thinking, start googling!
Gather initial dataset into proper infrastructure
Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
Define and quickly develop the simplest solution to the problem
Release/demo first basic solution
Research of background for ways to improve the basic solution
Gather additional data into proper infrastructure (if required)
Ad-hoc Exploratory Data Analysis (EDA)
Propose better solution minimising potential risks and marginal gain

At this stage you have already a benchmark reference using the simple solution. You have done your research of how to improve and meet the business requirements. You have a good overview of the initial dataset used for solving this problem. You can now start your engineering stage and produce the right dataset according to your application domain and the required data quality.

THE ETL

13. Develop the Data Sanity check

There is no dataset on Earth that does not require a sanity check. Filter out all the malformed, invalid, irrelevant records. Some time a cleansing step is also worth. Instead of throwing everything away you may try to sanitise the bad records.

Make sure to repeat this process every time you are running your model with a different dataset. Make sure this process is:

automated
logging error messages
stopping the execution of your job in case you are handing your application over someone else that might use it with the wrong dataset

As a data scientist you don't want to be blamed for having implemented a non-working model simply because someone else used it in the wrong way.

14. Define the Data Types of your application domain

Your data types are the first-class citizens of your application. Define them carefully accounting for how you would like to model your data in your domain rather than how the data currently looks like. It might be worth considering here optional fields, structured fields (for example a postcode might be represented as string or as a triple district code, sector code, unit code), identifier may require a long instead of an integer, categorical values might be hard coded using enumerations, timestamps could be stored as epoch time and so on. Avoid to have duplicated information in your data types, use primary keys to join your data collections later on.

Pre-mature optimisations are discouraged but as rule of thumb try to keep your types light. That is, do not use strings for representing numbers or any other expensive data structure. If you need to combine multiple fields into a single identifier use tuples instead of concatenating them into a single expensive object. It might make no difference now but refactoring the code to accommodate a different data type is one of the most expensive and painful task. Moreover, expensive types will cause scalability issues pretty soon. There will be always time for fixing it later but if you have to make a choice now and it requires the same effort, why not doing it well?

15. Develop the ETL and output the normalised data into a proper infrastructure

Your ETL goal is now to produce the desired output according to the previously defined data types so that you don't want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified.

If the raw data don't match the desired output format, here is where you want to do all of your transformations.

Any ETL job should always be finalised with the persistence to some data storage. It is discouraged to do the ETL as a pre-processing on-the-fly step for your application. Reason is that you want to quickly repeat all of your analysis on top of the normalised data rather than re-run the ETL every single time.

You can now forget about the original raw data and you can move your focus onto the high-quality dataset meeting your application requirements. Time for developing the model? Not yet. How many assumptions have you made so far and are you going to make in your model? Some data assumptions can be verified during the data check but what about your formulated hypothesis? The goal of delivering a data product is solving the business problem in its real context, unverified assumptions can easily invalidate your solution.

***

Details of how to perform the Hypothesis-Driven Analysis will follow on the next post of the "Agile Data Science Iteration 0" series, stay tuned.
Meanwhile, why not sharing or commenting below?

The Simple Solution << prev | next >> WIP

Bryan Hudson

In the business of innovation, it is riskier to be risk averse.

9 年

Have you thought about using an ELT instead of ETL model? Having a copy of the raw data in its native format can come in quite handy when scrubbing or munging the data.

要查看或添加评论，请登录

Gianmario Spacagna的更多文章

Are tech leaders taking full advantage of modern cloud architectures to ease the transition from AI research to an enterprise product?

2020年4月30日

Are tech leaders taking full advantage of modern cloud architectures to ease the transition from AI research to an enterprise product?

I recently came across an article by Chris Samiullah about How to Deploy Machine Learning Models and I was glad to…
Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

2018年2月6日

Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

I published on GitHub a tutorial on how to implement an algorithm for predictive maintenance using survival analysis…

4 条评论
Demystifying Data Science

2017年9月18日

Demystifying Data Science

On June the 7th I had a quick introductory talk at AssoLombarda in Milan regarding the role of Data Scientist into the…
Lessons learnt building data-driven production systems at Barclays

2016年3月31日

Lessons learnt building data-driven production systems at Barclays

In the last years at Barclays we learnt and tried a lot of stuff that made the Advanced Analytics team very successful…

1 条评论
The Jira ScrumBan board

2016年3月30日

The Jira ScrumBan board

In this post I want to share one of the core tool of the agile workflow that we use at the Advanced Analytics team at…

1 条评论
Functional Data Validation using monads and applicative functors

2016年3月9日

Functional Data Validation using monads and applicative functors

ETL is probably the most time consuming part of every Data Science project. The quality of extracted and crunched data…
Logical Warehouse for Data Science: map raw relational tables into Spark in-memory with Tachyon

2016年2月18日

Logical Warehouse for Data Science: map raw relational tables into Spark in-memory with Tachyon

Common problems for large organizations dealing with Big Data and Data Science applications are: Data stored in non…
Python, Scala or PyScala? How to chose the best Data Science language for Apache Spark

2016年1月28日

Python, Scala or PyScala? How to chose the best Data Science language for Apache Spark

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics…

4 条评论
WordPress Blog Posts Recommender in Spark, Scala and SparkNotebook

2015年11月19日

WordPress Blog Posts Recommender in Spark, Scala and SparkNotebook

At the Advanced Data Analytics team at Barclays we solved the Kaggle competition as proof-of-concept of how to use…
18 steps to apply Agile in Data Science

2015年11月12日

18 steps to apply Agile in Data Science

Introduction It is a very common pattern in software development to start a new project in a highly uncertain and…

3 条评论

See all articles

"The ETL" - Agile Data Science Iteration 0

Gianmario Spacagna

Principal AI Engineer @ Moonsong Labs | Web3 and Decentralized AI | Consultant | Agents Development

Previously

THE ETL

13. Develop the Data Sanity check

14. Define the Data Types of your application domain

15. Develop the ETL and output the normalised data into a proper infrastructure

Gianmario Spacagna的更多文章

社区洞察

其他会员也浏览了

The Game Changers : DataOps & MLOps ....

Data Version Control: Elevate Your Data Science Workflow with DVC

Data operations

The Dataops Evolution

What is ETL and How its help in Software Testing 2024

How Data Warehouses Can be Awesome for Business Insights

How Data Warehouses Can be Awesome for Business Insights

The Data Pipeline Requirements Model

Size does matter, also in data modeling

Previously

THE ETL

13. Develop the Data Sanity check

14. Define the Data Types of your application domain

15. Develop the ETL and output the normalised data into a proper infrastructure

Gianmario Spacagna的更多文章

Are tech leaders taking full advantage of modern cloud architectures to ease the transition from AI research to an enterprise product?

Deep Time-to-Failure: Predictive maintenance using RNNs and Weibull distributions

Demystifying Data Science

Lessons learnt building data-driven production systems at Barclays

The Jira ScrumBan board

Functional Data Validation using monads and applicative functors

Logical Warehouse for Data Science: map raw relational tables into Spark in-memory with Tachyon

Python, Scala or PyScala? How to chose the best Data Science language for Apache Spark

WordPress Blog Posts Recommender in Spark, Scala and SparkNotebook

18 steps to apply Agile in Data Science

社区洞察

其他会员也浏览了

The Game Changers : DataOps & MLOps ....

Data Version Control: Elevate Your Data Science Workflow with DVC

Data operations

The Dataops Evolution

What is ETL and How its help in Software Testing 2024

How Data Warehouses Can be Awesome for Business Insights

How Data Warehouses Can be Awesome for Business Insights

The Data Pipeline Requirements Model

Size does matter, also in data modeling