登录查看更多内容

Automated ETL vs Manual Coding : Which is better choice in Data Management?

Koustubh Dhopade

High-Impact Leader, Innovator in implementing Modern Data Platforms & Programs, Data evangelist, Mining & Engineering, Solution architect and Machine learning enthusiast

发布日期: 2021年5月24日

Lot of technical programmer have written in this space. I have been instrumental consistently designing legacy to modern next-gen platform on cloud for many customers now, I thought to touch upon this topic once again. Since the emergence of the GUI ETL tools, the data management community has debated the merits of using such tools as opposed to manual coding via SQL, Python, and other programming or scripting languages. I will try to present a case for automated ETL, as well as provide some examples from my own experience to demonstrate how data analysts, support engineers and developers benefit from using ETL tools – or better yet, BI tools with built-in ETL functionality.

We came across first cloud migration project 3 years back, where we have to move entire legacy platform running against MS SQL Server to Snowflake. Legacy platform was built purely on SQL, Views, Materialized Views & third party interface to manage scheduling options. Customer was facing issues for data drifts, addition of new sources and plenty of overheads to manage the data problems in live environment due to which OLAP cube was dependent. Cube refresh was taking around 18-24 hours for incremental refresh. We had a very lean team where we were doing reverse engineering to know-how the process as usual there is hardly any documentation around the design or thought process or even descriptions for complex routines. We had been involved right from assessment, Technology Platform selection, tool evaluation & even creating the prototypes prior procurement.

Before we dive into it, let’s take a moment to explain the concept of Extract-Transform-Load, or in its more common abbreviation – ETL. It is a crucial stage in the data preparation process as it ensures we are working with clean and accurate data and that there is a single source of truth for every piece of information that appears in our analytical database. It consists of three stages:

Extracting the data stored in various systems (flat files, cloud repositories, relational databases, etc.) into a centralized data repository.
Transforming said data according to a unified schema, which could include filtering out certain fields \ columns or modifying data from different tables to make sure all the data sources speak the same language.
Loading the newly-transformed data into a centralized data repository or data warehouse.

Problem with Programming languages is they are modelled using OOP designs. They manipulate data by the element, not by rows or columns, this is very inefficient. You cannot do heterogeneous sources without building on top of previous libraries that may require you a lot of time, depending on the source. It has a lot of features all over different libraries or files, so it takes a lot of code integration before you can start on anything. It's tough to manage run and recovery paths, handshakes, because everything is at one level (adding two numbers from the same row together, adding two or more rows together, adding multiple data sets from different sources, then auto-run, auto-recovery, handshake, etc).

The biggest pain point is if you have to do a lot of changes in calculations. Manipulating data elements means loops, if a logic change, you may have to create more nested loops, get rid of nested loops, joins on different sources means loop within a loop within a loop....and so on. Very hard to do performance tuning that makes a huge impact in a short amount of time.

Another challenge with programming stack is, when you are taking knowledge transition from incumbent vendor or change of platform. We need to hire or deploy right skillset resources unlike ETL tool. Informatica resource can pick up Talend or Matillion with minimal hand-holding.

We had a scenario where we need to decide the right middleware tool Informatica(IICS), Snaplogic, Matillion or Python/PySpark as replacement to existing MSSQL mechanism for transformational activities into Cloud platform.

The first question is "How complex is your environment?" Complexity can be a measure of how many source systems you have feeding your data warehouse, how many "sources of truth" for the same basic piece(s) of information reside in your various source systems, how difficult it is to transform data from the format that exists in your source system to the format needed in the (data warehouse) target system, the need to have metadata in your warehouse to give the data an appropriate pedigree, and the need to develop feedback loops to clean data in your source systems as problems are identified in the load process. Generally speaking, the more complex your environment is, the more likely you are to benefit from a tool as opposed to writing your own routines.

Secondly, are your source systems constantly changing (or likely to?) If so, you may be better off using an automated tool instead of creating a maintenance nightmare for yourself. What about your target system? If lack of experience in creating data warehouses is likely to cause you to revisit the design of your warehouse schema, automation could make life easier as well.

Third is Available Resources/Time - If your project is small/simple, then it you may be better off utilizing existing skillsets as opposed to taking the time to go through the learning curve needed to use a new ETL tool. But if developer resources are scarce, you are in a time crunch, or your project is complex based on factors mentioned in the previous section, you may be better off using an ETL tool.

Other advantages of ETL Tools in Data Warehousing scenarios include:

A set of comprehensive Scheduling Mechanisms
Logging, Audit and Metadata support
Easier Maintenance specially when multiple developers are to be involved
Robust execution control and error handling
Parallel processing
High Availability
Partitioning / Push-down Optimization capabilities

3 Days vs 3 minutes

One of our customers in the Security sector had been using SQL for manual data mash-ups. The company covered global regions, and each such region generated a massive data. Joining all these sources took 3 people devoting almost 3 days every week, and all this work was done before any analysis could even be performed – just to prepare the data. The company realized it was time to implement a BI solution, and soon discovered that a single automated ETL tool could complete this union within minutes, and with no additional work hours invested once the system and data model were set up. Configurable design with Metadata driven approach built over Snowflake using Matillion as ETL helped to curtail down the time to 3 minutes. It did help to spin up large VWH to incrementally refresh OLAP cube using CDATA connector for SSAS framework over snowflake. It reduced the time from 24 hours to 3.5 hrs.

Secondly, we reduced production support overheads by almost 30% in another project engagement where we had to extend data compass configuration over Redshift.

…And the Counter-Argument

Of course, all this is not to say that automated ETL is the way to go 100% of the time and for every possible use case and organizations. Companies that have large data science teams might find it easier to continue using their tried-and-tested manual processes instead of implementing a new tool, which obviously will have a learning curve and will require some getting used to. Developers who are used to thinking in code might be frustrated by visual interfaces. And since every automated ETL tool will have its own workflow and logic, it could seem restrictive compared to the endless possibilities one could create manually. But Visual tools analogy remains same as those are structural components unlike programming languages as it all depends on variety of flow, libraries, syntaxes , parsing mechanism, hardware, RAM, CPU maintenance etc. etc.

The Bottom Line: Time to Automate

Code “purists” will always have reservations about any tool that simplifies manual coding – but still in many cases the advantages overweigh the drawbacks. For example, C++ is definitely more powerful than JavaScript, but most would agree that in many cases the latter is preferable. I believe that in most cases where extensive data preparation is required, automated ETL will provide a cleaner, more transparent and more straightforward experience than manual coding – particularly when working with big, rapidly changing data.

Like any other problem in this field, there are multiple ways to arrive at a solution. The key is understanding the trade-offs you have to make and then determining what solution works best for your environment. The answer to that question is always based on technical issues, personnel issues, and the culture of the company involved.

要查看或添加评论，请登录

Koustubh Dhopade的更多文章

How to accelerate time to value with predictive data quality and observability

2023年3月29日

How to accelerate time to value with predictive data quality and observability

Trusted decisions are driven by high-quality data. However, ensuring constant access to high-quality data is difficult,…

3 条评论
Metadata-Driven (MDD) Framework Meets Data Warehouse Automation

2022年6月1日

Metadata-Driven (MDD) Framework Meets Data Warehouse Automation

Data has been the fastest growing currency of the future that outrivals the value of the likes of oil and gold in…

8 条评论
Key Steps to DataOps Success

2021年11月29日

Key Steps to DataOps Success

Enterprises face ongoing pressure to rapidly release new products and capabilities to meet market demands. Outdated…

1 条评论
How Leaders Get Through Tough Times

2021年9月14日

How Leaders Get Through Tough Times

Post pandemic, IT industry has flourished with tremendous opportunities. Demand has surged in the market & helped…

1 条评论
What is Dremio? How it can be used?

2021年9月14日

What is Dremio? How it can be used?

With the evolution of new buzz word Big Data, Cloud and way of handling data in much better way in terms of performance…

2 条评论
Building Modern Data Architecture, Keep these things in Mind !!

2021年9月9日

Building Modern Data Architecture, Keep these things in Mind !!

The database world is changing fast, and may see some more in coming years as snowflake has given breakthrough to…

7 条评论
Data Engineering Design Principles : Holds true even in modern era

2021年6月11日

Data Engineering Design Principles : Holds true even in modern era

There are many factors to consider when designing data pipelines, which include disparate data sources, dependency…
How we balance individual autonomy with accountability to make progress on our project's ambitious goals

2021年5月22日

How we balance individual autonomy with accountability to make progress on our project's ambitious goals

I happened to be in delivery engagement in 2018 after a stint in Sales & marketing team for 15 months. I accidentally…

2 条评论
Concurrent Employment: is it Legal, Radical or cost effective??

2021年5月8日

Concurrent Employment: is it Legal, Radical or cost effective??

CNN has an article that I find extremely interesting… some employees are working two jobs at the same time. This…
Metadata Management on Raw Data Vault

2021年4月26日

Metadata Management on Raw Data Vault

In many of our projects, clients ask us how to track metadata, “the data about data.” While there are some solutions…

See all articles

Automated ETL vs Manual Coding : Which is better choice in Data Management?

Koustubh Dhopade

High-Impact Leader, Innovator in implementing Modern Data Platforms & Programs, Data evangelist, Mining & Engineering, Solution architect and Machine learning enthusiast

Koustubh Dhopade的更多文章

社区洞察

其他会员也浏览了

ETL AND ELT

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

Testing Trading Data in Automation Testing using AWS Glue Visual ETL

AWS Glue, Athena and Visual ETL based Data Quality Improvement

A Comprehensive Guide to Building an ETL Process Using Python and SQL

?? Integrations Unlocked: ETL Pipelines (Part 1) ??

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

DBT vs. Traditional ETL Tools: A Comparative Analysis

?? Automating Data Extraction from Client Directories: Streamlining ETL with PowerShell

Koustubh Dhopade的更多文章

How to accelerate time to value with predictive data quality and observability

Metadata-Driven (MDD) Framework Meets Data Warehouse Automation

Key Steps to DataOps Success

How Leaders Get Through Tough Times

What is Dremio? How it can be used?

Building Modern Data Architecture, Keep these things in Mind !!

Data Engineering Design Principles : Holds true even in modern era

How we balance individual autonomy with accountability to make progress on our project's ambitious goals

Concurrent Employment: is it Legal, Radical or cost effective??

Metadata Management on Raw Data Vault

社区洞察

其他会员也浏览了

ETL AND ELT

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

Testing Trading Data in Automation Testing using AWS Glue Visual ETL

AWS Glue, Athena and Visual ETL based Data Quality Improvement

A Comprehensive Guide to Building an ETL Process Using Python and SQL

?? Integrations Unlocked: ETL Pipelines (Part 1) ??

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

DBT vs. Traditional ETL Tools: A Comparative Analysis

?? Automating Data Extraction from Client Directories: Streamlining ETL with PowerShell