登录查看更多内容

Dawn of HTAP databases will spell the end for ETL and Data Warehouses

Constantin Alexander

Data Platform, AiOps, Data Governance

发布日期: 2019年7月26日

HTAP.What is HTAP ? The term HTAP simply means - Hybrid Transactional and Analytics Processing. The key here is "H" hybrid - we always had OLTP databases (online transactional processing) and OLAP databases (Online Analytical Processing).

ETL.OLTP and OLAP have created entire industry around transporting data from OLTP to OLAP, data is usually originating in OLTP and then transported to OLAP to run queries and analytics. OLAP and OLTP had its dedicated roles and ETL (Extract Transport Load) bridged the gap - but it was not that simple, ETL is one of the worst ideas in data processing, it's not flexible, highly complex and the field is awash with vendors with confusing (mostly marketing) terminology ETL, ETTL, ELT - no matter how you re arrange it, it is still one of the worst processes. First it hammers the source databases to "extract" the data, then it batch loading the data into DSS databases (RedShift, Snowflake etc...), and no matter what you do the reports are delayed and batched. The situation will get a lot worse if you have frequent schema changes - one can get the picture, the entire careers were dedicated to ETL. Not only ETL was one of the worst implementation I would argue that ETL's wide proliferation and monetization in many extends has stifled development of HTAP.

Streaming Data is still ETL. The invention of data pipelines and data streaming along with development of new breed of messaging services (Kafka, Pulsar) substantially improved the dreaded ETL but there is still an "E" extraction, even reading data from binlogs (CDC) you are still extracting and there is still and "L" for loading. "T" transformation moved to Data Warehouse itself. Better then "ETL" but not a lot better. Even more complex, to build a data pipeline you need CDC experience, Kafka or Pulsar experience, Containers, Connectors/Adaptors and of course a Kubernetes (if you are planning ahead) to manage the entire thing. You will need a fairly large team of experts with varying skills to power the entire thing up Not very trivial, at least old ETL stayed in the realm of data professionals. Data Streaming is better ETL if can locate and then afford to hire experts. The limitation is still the same as old "L" loading - it takes time to view your precious data and you have to wait for it to be loaded, not real time.

HTAP. "H" for Hybrid. HTAP will do it all - its a transactional and analytics database, it will store the data in high velocity and you can query that same data for analytics right away no ETL!

How does it work is it a miracle? No not a miracle just smart and forward thinking and development and proliferation of distributed computing. I would say Hadoop laid the foundation for distributed computing with HDFS and Map Reduce. Once distributed computing started gaining acceptance everyone was joining. The architecture of distributed computing was designed with nearly linear scalability, need more power to crunch data - add more nodes. Another awesome aspect of this distributed architecture is detachment of storage from services, each can be scaled individually. This architecture is one the reasons Snowflake will demolish RedShift since detachment of storage from services delivers nearly unlimited scalability, this is the same reason Apache Pulsar will dethrone Kafka, Apache Pulsar operates on distributed storage platform (bookkeper), Kafka does not - it will hit the scalability wall sooner or later and there is no point of improving it since Pulsar already here.

HTAP is practice. Enter TiDB https://github.com/pingcap/tidb. There are a few HTAP databases in works, but I selected TiDB for a reason - it is loaded from beginning to win and elegantly simple. Just like in athletics there a talent for doing things and the same applies to technology, looking at TiDB it already has all the winning qualities:

Apache 2.0 license, while others are busy building complex crippleware with limitations that will stifle product's adoption and most importantly turn off army of community volunteers myself included. Mixing open source and "Enterprise" is akin dissolving motor oil in the water, the process is messy and what you get in the end is completely useless - neither open source nor "Enterprise".
TiDB is just simply build and architectured for wide adoption, there are no "Enterprise" features. I'm already trained what open source and "Enterprise" features mean - it means you get to use watered down version that is missing critical components but they still need your help to work for them for free. No thank you. CockroachDB comes to mind but sadly it chose "open source" and "Enterprise" already loaded to loose.
Architecture. Simply superior - distributed storage, RocksDB engine. Most importantly TiDB architecture designed for limitless scalability. The awesome and really smart architecture TiDB using exiting components in the right places - why write the entire storage engine from scratch - just use RocksDB and TiKV
MySQL compatibility. MySQL is the most widely used database today if you are even remotely considering to bring new database to the market it must absolutely must do the following:
Integrate with MySQL - replicate data to from MySQL. TiDB has it.
Import data to from MySQL. TiDB already has it
MySQL similar syntax. TiDB already has it.

Last but not least it is stateless and works well in Kubernetes. Any distributed database that does not play well in container orchestration is not very useful - who is going to keep track of all the nodes, monitor and deliver HA, it must work with at least with Kubernetes.

HTAP. There is more and this is where HTAP comes in, Spark sits on top of TiDB storage TiKV and connected to it via TiSpark - simply genius. Why build database for OLTP and OLAP, we already have distributed storage - just add Spark connector and feel free to drill down on data - NO ETL !

MySQL was an absolute ruler in Internet of Things to store data, but it's reign is coming to the end in the near future, Btrees, not distributed, very hard to change schema, arcane scalability - writers and readers.

MySQL needed a successor badly, I believe TiDB is ready and will pick up the torch, it can do all the things that MySQl does today better and it is HTAP.

Join out LA and San Francisco TiDB meetups incubating and lets build a better future together!

https://www.meetup.com/Los-Angeles-TiDB-Meetup

https://www.meetup.com/San-Francisco-TiDB-Meetup

Thank you for reading !

Tailin(Frank) Fang

3 年

Hayk C.

Founder @Agentgrow | 3x Head of Sales

5 年

But HTAP is a cool idea

Hayk C.

Founder @Agentgrow | 3x Head of Sales

5 年

Import data to MySQL sounds a lot like ELT

查看更多评论

要查看或添加评论，请登录

Constantin Alexander的更多文章

Understanding Data Governance

2023年6月7日

Understanding Data Governance

Understanding Data Governance Data governance refers to the overall management framework that ensures the availability,…
Graph databases vs Vector databases

2023年6月5日

Graph databases vs Vector databases

Graph databases and Vector databases, on the surface they may appear very similar, alas on the close examination they…

2 条评论
Data Catalogs Insights and Collaboration

2023年6月3日

Data Catalogs Insights and Collaboration

In the digital age, organizations are generating massive amounts of data at an unprecedented rate. To harness the true…
Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

2021年3月9日

Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

Snowflake Data Warehouse did very well, in highly competitive and ever changing field of data analytics Snowflake from…
Doing impossible - migrating gcp CloudSQL live from 5.6 to 5.7

2021年3月6日

Doing impossible - migrating gcp CloudSQL live from 5.6 to 5.7

Google Cloud - Google Cloud Platform, very versatile cloud offering and comes loaded with a lot of very interesting…
Friends don't let friends use MongoDB!

2020年3月15日

Friends don't let friends use MongoDB!

"Friends don't let friends use Relational Databases" - truly amazing, freedom, free and liberation from SQL, very…

8 条评论
Coronavirus fear - sharing my approach

2020年3月14日

Coronavirus fear - sharing my approach

As panic and fear setting in - what is the best approach and plan to remain healthy, active and most importantly happy,…

6 条评论
Using Amazon performance insights to save $$$ on your RDS bill

2020年1月21日

Using Amazon performance insights to save $$$ on your RDS bill

Amazon AWS has introduced performance insights, up until last year we didn't really have a really good visibility into…
Remembering Alan Turing, June 23, 1912

2019年12月22日

Remembering Alan Turing, June 23, 1912

As we enter 2020 and slipping into Holidays, I would like to take some of your time and remember the greatest person of…
We as a society have amassed an enormous moral debt to African Americans and we and future generations going have to repay that debt.

2019年6月16日

We as a society have amassed an enormous moral debt to African Americans and we and future generations going have to repay that debt.

Fellow professionals, I usually write on the technical subjects but this video of African American couple and their…

See all articles

Dawn of HTAP databases will spell the end for ETL and Data Warehouses

Constantin Alexander

Data Platform, AiOps, Data Governance

Constantin Alexander的更多文章

社区洞察

其他会员也浏览了

ETL

The ETL to ELT to EtLT Evolution, and data pipelines

ETL pipelines

Reverse ETL on Snowflake

ETL IS DEAD

The Must-Have ETL Tools to Unleash Data Warehousing Potential in 2023

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

Building Resilient ETL Pipelines: Advanced Strategies for Handling Failures and Ensuring Data Integrity

The Evolution of ETL (Extract, Transform, Load) Processes: A Journey from Simplicity to Innovation

Constantin Alexander的更多文章

Understanding Data Governance

Graph databases vs Vector databases

Data Catalogs Insights and Collaboration

Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

Doing impossible - migrating gcp CloudSQL live from 5.6 to 5.7

Friends don't let friends use MongoDB!

Coronavirus fear - sharing my approach

Using Amazon performance insights to save $$$ on your RDS bill

Remembering Alan Turing, June 23, 1912

We as a society have amassed an enormous moral debt to African Americans and we and future generations going have to repay that debt.

社区洞察

其他会员也浏览了

ETL

The ETL to ELT to EtLT Evolution, and data pipelines

ETL pipelines

Reverse ETL on Snowflake

ETL IS DEAD

The Must-Have ETL Tools to Unleash Data Warehousing Potential in 2023

To hell and back with ETL. The unstoppable rise of data warehouse automation.

Reverse ETL vs. ETL

Building Resilient ETL Pipelines: Advanced Strategies for Handling Failures and Ensuring Data Integrity

The Evolution of ETL (Extract, Transform, Load) Processes: A Journey from Simplicity to Innovation