"The ETL" - Agile Data Science Iteration 0
Gianmario Spacagna
Principal AI Engineer @ Moonsong Labs | Web3 and Decentralized AI | Consultant | Agents Development
This is the fourth post of the Agile Data Science Iteration 0 series:
- The problem definition and the evaluation strategy
- The initial investigation
- The simple solution
- * The ETL <=
- The Hypothesis-Driven Analysis (HDA)
- The final planning
Previously
What we have achieved so far (see previous posts above):
- Rigorous definition of the business problem we are attempting to solve and why it is important
- Define your objective acceptance criteria
- Develop the validation framework (ergo, the acceptance test)
- Stop thinking, start googling!
- Gather initial dataset into proper infrastructure
- Initial Exploratory Data Analysis (EDA) targeted to understanding the underlying dataset
- Define and quickly develop the simplest solution to the problem
- Release/demo first basic solution
- Research of background for ways to improve the basic solution
- Gather additional data into proper infrastructure (if required)
- Ad-hoc Exploratory Data Analysis (EDA)
- Propose better solution minimising potential risks and marginal gain
At this stage you have already a benchmark reference using the simple solution. You have done your research of how to improve and meet the business requirements. You have a good overview of the initial dataset used for solving this problem. You can now start your engineering stage and produce the right dataset according to your application domain and the required data quality.
THE ETL
13. Develop the Data Sanity check
There is no dataset on Earth that does not require a sanity check. Filter out all the malformed, invalid, irrelevant records. Some time a cleansing step is also worth. Instead of throwing everything away you may try to sanitise the bad records.
Make sure to repeat this process every time you are running your model with a different dataset. Make sure this process is:
- automated
- logging error messages
- stopping the execution of your job in case you are handing your application over someone else that might use it with the wrong dataset
As a data scientist you don't want to be blamed for having implemented a non-working model simply because someone else used it in the wrong way.
14. Define the Data Types of your application domain
Your data types are the first-class citizens of your application. Define them carefully accounting for how you would like to model your data in your domain rather than how the data currently looks like. It might be worth considering here optional fields, structured fields (for example a postcode might be represented as string or as a triple district code, sector code, unit code), identifier may require a long instead of an integer, categorical values might be hard coded using enumerations, timestamps could be stored as epoch time and so on. Avoid to have duplicated information in your data types, use primary keys to join your data collections later on.
Pre-mature optimisations are discouraged but as rule of thumb try to keep your types light. That is, do not use strings for representing numbers or any other expensive data structure. If you need to combine multiple fields into a single identifier use tuples instead of concatenating them into a single expensive object. It might make no difference now but refactoring the code to accommodate a different data type is one of the most expensive and painful task. Moreover, expensive types will cause scalability issues pretty soon. There will be always time for fixing it later but if you have to make a choice now and it requires the same effort, why not doing it well?
15. Develop the ETL and output the normalised data into a proper infrastructure
Your ETL goal is now to produce the desired output according to the previously defined data types so that you don't want to do any additional pre-processing in your application and all of the requirements of the data format and quality are verified.
If the raw data don't match the desired output format, here is where you want to do all of your transformations.
Any ETL job should always be finalised with the persistence to some data storage. It is discouraged to do the ETL as a pre-processing on-the-fly step for your application. Reason is that you want to quickly repeat all of your analysis on top of the normalised data rather than re-run the ETL every single time.
You can now forget about the original raw data and you can move your focus onto the high-quality dataset meeting your application requirements. Time for developing the model? Not yet. How many assumptions have you made so far and are you going to make in your model? Some data assumptions can be verified during the data check but what about your formulated hypothesis? The goal of delivering a data product is solving the business problem in its real context, unverified assumptions can easily invalidate your solution.
***
Details of how to perform the Hypothesis-Driven Analysis will follow on the next post of the "Agile Data Science Iteration 0" series, stay tuned.
Meanwhile, why not sharing or commenting below?
The Simple Solution << prev | next >> WIP
In the business of innovation, it is riskier to be risk averse.
9 年Have you thought about using an ELT instead of ETL model? Having a copy of the raw data in its native format can come in quite handy when scrubbing or munging the data.