Automated ETL vs Manual Coding : Which is better choice in Data Management?

Lot of technical programmer have written in this space. I have been instrumental consistently designing legacy to modern next-gen platform on cloud for many customers now, I thought to touch upon this topic once again. Since the emergence of the GUI ETL tools, the data management community has debated the merits of using such tools as opposed to manual coding via SQL, Python, and other programming or scripting languages. I will try to present a case for automated ETL, as well as provide some examples from my own experience to demonstrate how data analysts, support engineers and developers benefit from using ETL tools – or better yet, BI tools with built-in ETL functionality.

 We came across first cloud migration project 3 years back, where we have to move entire legacy platform running against MS SQL Server to Snowflake. Legacy platform was built purely on SQL, Views, Materialized Views & third party interface to manage scheduling options. Customer was facing issues for data drifts, addition of new sources and plenty of overheads to manage the data problems in live environment due to which OLAP cube was dependent. Cube refresh was taking around 18-24 hours for incremental refresh. We had a very lean team where we were doing reverse engineering to know-how the process as usual there is hardly any documentation around the design or thought process or even descriptions for complex routines. We had been involved right from assessment, Technology Platform selection, tool evaluation & even creating the prototypes prior procurement.

 Before we dive into it, let’s take a moment to explain the concept of Extract-Transform-Load, or in its more common abbreviation – ETL. It is a crucial stage in the data preparation process as it ensures we are working with clean and accurate data and that there is a single source of truth for every piece of information that appears in our analytical database. It consists of three stages:

  • Extracting the data stored in various systems (flat files, cloud repositories, relational databases, etc.) into a centralized data repository.
  • Transforming said data according to a unified schema, which could include filtering out certain fields \ columns or modifying data from different tables to make sure all the data sources speak the same language.
  • Loading the newly-transformed data into a centralized data repository or data warehouse.

Problem with Programming languages is they are modelled using OOP designs. They manipulate data by the element, not by rows or columns, this is very inefficient. You cannot do heterogeneous sources without building on top of previous libraries that may require you a lot of time, depending on the source. It has a lot of features all over different libraries or files, so it takes a lot of code integration before you can start on anything. It's tough to manage run and recovery paths, handshakes, because everything is at one level (adding two numbers from the same row together, adding two or more rows together, adding multiple data sets from different sources, then auto-run, auto-recovery, handshake, etc).

The biggest pain point is if you have to do a lot of changes in calculations. Manipulating data elements means loops, if a logic change, you may have to create more nested loops, get rid of nested loops, joins on different sources means loop within a loop within a loop....and so on. Very hard to do performance tuning that makes a huge impact in a short amount of time.

Another challenge with programming stack is, when you are taking knowledge transition from incumbent vendor or change of platform. We need to hire or deploy right skillset resources unlike ETL tool. Informatica resource can pick up Talend or Matillion with minimal hand-holding.

 We had a scenario where we need to decide the right middleware tool Informatica(IICS), Snaplogic, Matillion or Python/PySpark as replacement to existing MSSQL mechanism for transformational activities into Cloud platform.

 The first question is "How complex is your environment?" Complexity can be a measure of how many source systems you have feeding your data warehouse, how many "sources of truth" for the same basic piece(s) of information reside in your various source systems, how difficult it is to transform data from the format that exists in your source system to the format needed in the (data warehouse) target system, the need to have metadata in your warehouse to give the data an appropriate pedigree, and the need to develop feedback loops to clean data in your source systems as problems are identified in the load process. Generally speaking, the more complex your environment is, the more likely you are to benefit from a tool as opposed to writing your own routines.

Secondly, are your source systems constantly changing (or likely to?) If so, you may be better off using an automated tool instead of creating a maintenance nightmare for yourself. What about your target system? If lack of experience in creating data warehouses is likely to cause you to revisit the design of your warehouse schema, automation could make life easier as well.

Third is Available Resources/Time - If your project is small/simple, then it you may be better off utilizing existing skillsets as opposed to taking the time to go through the learning curve needed to use a new ETL tool. But if developer resources are scarce, you are in a time crunch, or your project is complex based on factors mentioned in the previous section, you may be better off using an ETL tool.

 Other advantages of ETL Tools in Data Warehousing scenarios include:

  • A set of comprehensive Scheduling Mechanisms
  • Logging, Audit and Metadata support
  • Easier Maintenance specially when multiple developers are to be involved
  • Robust execution control and error handling
  • Parallel processing
  • High Availability
  • Partitioning / Push-down Optimization capabilities

 3 Days vs 3 minutes

One of our customers in the Security sector had been using SQL for manual data mash-ups. The company covered global regions, and each such region generated a massive data. Joining all these sources took 3 people devoting almost 3 days every week, and all this work was done before any analysis could even be performed – just to prepare the data. The company realized it was time to implement a BI solution, and soon discovered that a single automated ETL tool could complete this union within minutes, and with no additional work hours invested once the system and data model were set up. Configurable design with Metadata driven approach built over Snowflake using Matillion as ETL helped to curtail down the time to 3 minutes. It did help to spin up large VWH to incrementally refresh OLAP cube using CDATA connector for SSAS framework over snowflake. It reduced the time from 24 hours to 3.5 hrs.

Secondly, we reduced production support overheads by almost 30% in another project engagement where we had to extend data compass configuration over Redshift.

 …And the Counter-Argument

Of course, all this is not to say that automated ETL is the way to go 100% of the time and for every possible use case and organizations. Companies that have large data science teams might find it easier to continue using their tried-and-tested manual processes instead of implementing a new tool, which obviously will have a learning curve and will require some getting used to. Developers who are used to thinking in code might be frustrated by visual interfaces. And since every automated ETL tool will have its own workflow and logic, it could seem restrictive compared to the endless possibilities one could create manually. But Visual tools analogy remains same as those are structural components unlike programming languages as it all depends on variety of flow, libraries, syntaxes , parsing mechanism, hardware,  RAM, CPU maintenance etc. etc.

 The Bottom Line: Time to Automate

Code “purists” will always have reservations about any tool that simplifies manual coding – but still in many cases the advantages overweigh the drawbacks. For example, C++ is definitely more powerful than JavaScript, but most would agree that in many cases the latter is preferable. I believe that in most cases where extensive data preparation is required, automated ETL will provide a cleaner, more transparent and more straightforward experience than manual coding – particularly when working with big, rapidly changing data. 

Like any other problem in this field, there are multiple ways to arrive at a solution. The key is understanding the trade-offs you have to make and then determining what solution works best for your environment. The answer to that question is always based on technical issues, personnel issues, and the culture of the company involved.

要查看或添加评论,请登录

Koustubh Dhopade的更多文章

社区洞察

其他会员也浏览了