Who will win the ETL tool race?
Source: DALL-E Generated

Who will win the ETL tool race?

I was catching up on Substack this morning and ran across an article, A Zero ETL Future that I found provocative and interesting.

ETL, or "Extract, Transform, Load" is the data engineering practice that makes data usable for analysis and reporting. The work product is a data pipeline, an automated set of processes that move and reshape data from formats that are suitable for fast business operations, to formats that enable reasonably fast business analysis.?

The problem has been that the construction of data pipelines can be mind-numbingly complex, lengthy, and costly. The resulting ETL code is often inflexible in the face of business changes, error-prone, and easily broken.?

"Zero ETL" is a wistful concept discussed by data engineers that would allow data to flow from sources to targets with no, or minimal transformation. A few companies in the data ecosphere, including AWS, Snowflake and Databricks have been making noises around limiting the amount of ETL necessary for data analysis, generally by making it easier to query original sources of data, and by making analytical query engines more powerful. Shorter, simpler data pipelines would mean less development time, less maintenance, less cost, and faster access to data.?

Until recently, I worked at Incorta, a data management and analytics platform that has approached the data pipeline concept differently. Incorta touts its ability to analyze business application data with minimal ETL. They are able to do this with a unique capability that pre-computes some of the most common data transformations. Data analysis is essentially done via views on the original data. The benefits of this approach are numerous including: low pipeline development and maintenance costs, the ability to drill from high-level business metrics to transaction-level detail, and improved analytical flexibility.?

This experience has led me to believe that analytics against transaction-level, normalized data (the data format used by most business applications) will become the norm in a year or two. There are new analytics engines in the works that can replicate some of Incorta's attributes such as Databricks' Photon or the open-source StarRocks (https://www.starrocks.io/) database. These new engines show that a shallow view layer on top of original data can deliver good performance in a format that is flexible enough to serve different analytical uses. If this goal is achieved, the business benefits will be numerous (I've counted nine) and compelling (with business agility being at the top of the heap.)

However, some amount of data pipeline will always be necessary. As the author of this article points out, true elimination of big 'T' transformations would require that application vendors make their data available in a more consumable format. The author also points out the need for maintaining historical data, as business applications are generally weak in terms of capturing trends over time.?

But one other necessary element that the author does not mention is semantics, metadata, or data about the data. Data in its raw form is inaccessible to non-specialists. Even Oracle eBusiness Suite, with its relatively modern relational structures is breathtakingly complex and vast -- requiring expert knowledge to know where to tap in, and how to interpret. And Oracle EBS is one of the best; sit an unmedicated data engineer in front of an SAP R3 database for the first time and they will very likely run screaming from the building in a few hours.

To make data usable by even the data literati (analysts, data scientists) they need answers to questions like: "What data is available that is relevant to me?" and "What is the nature of this data?" and "What other data should I also be paying attention to?"?They also need to understand the source, timing, veracity, and reliability (quality) of the data.?

So, in addition to having easier access to data sources, better tools for (visually) constructing and operating data pipelines, and better analytics engines capable of analyzing raw data, we need better data about the data integrated with the tools.

There are data cataloging products that can provide a view of selected data assets, and give users information about the structure and usefulness of data. But these tools generally: a) do not cover all the data used, from raw sources to a business metric on a dashboard, and, b) are not available inside the tools, where those who are constructing or consuming data pipelines need it most.

We are missing a widely adopted standard for metadata that would allow data catalogs and data pipeline tools to interoperate. I think this is needed because true "No ETL" data pipelines (or at least, pipelines that we need to build) will only come about when we can start to reuse and build up a library of semantic data. The terrifying prospect of decoding an SAP database should only need to be experienced once.??

An industry-adopted standard for metadata exchange could get us a long way towards a "Zero ETL" future, with pre-built pipelines eventually coalescing into sharable marketplaces of domain-specific use cases. A number of ETL tool vendors are creating visual low-code or no-code tools that simplify data pipeline building, and this would be an ideal integration point for creating and consuming rich metadata, or "data about the data."

I think the ETL tool vendor that integrates visual design, low-code, metadata, and observability will be a winner. What do you think??

要查看或添加评论,请登录

Cameron O'Rourke的更多文章

社区洞察

其他会员也浏览了