Dawn of HTAP databases will spell the end for ETL and Data Warehouses
HTAP.What is HTAP ? The term HTAP simply means - Hybrid Transactional and Analytics Processing. The key here is "H" hybrid - we always had OLTP databases (online transactional processing) and OLAP databases (Online Analytical Processing).
ETL.OLTP and OLAP have created entire industry around transporting data from OLTP to OLAP, data is usually originating in OLTP and then transported to OLAP to run queries and analytics. OLAP and OLTP had its dedicated roles and ETL (Extract Transport Load) bridged the gap - but it was not that simple, ETL is one of the worst ideas in data processing, it's not flexible, highly complex and the field is awash with vendors with confusing (mostly marketing) terminology ETL, ETTL, ELT - no matter how you re arrange it, it is still one of the worst processes. First it hammers the source databases to "extract" the data, then it batch loading the data into DSS databases (RedShift, Snowflake etc...), and no matter what you do the reports are delayed and batched. The situation will get a lot worse if you have frequent schema changes - one can get the picture, the entire careers were dedicated to ETL. Not only ETL was one of the worst implementation I would argue that ETL's wide proliferation and monetization in many extends has stifled development of HTAP.
Streaming Data is still ETL. The invention of data pipelines and data streaming along with development of new breed of messaging services (Kafka, Pulsar) substantially improved the dreaded ETL but there is still an "E" extraction, even reading data from binlogs (CDC) you are still extracting and there is still and "L" for loading. "T" transformation moved to Data Warehouse itself. Better then "ETL" but not a lot better. Even more complex, to build a data pipeline you need CDC experience, Kafka or Pulsar experience, Containers, Connectors/Adaptors and of course a Kubernetes (if you are planning ahead) to manage the entire thing. You will need a fairly large team of experts with varying skills to power the entire thing up Not very trivial, at least old ETL stayed in the realm of data professionals. Data Streaming is better ETL if can locate and then afford to hire experts. The limitation is still the same as old "L" loading - it takes time to view your precious data and you have to wait for it to be loaded, not real time.
HTAP. "H" for Hybrid. HTAP will do it all - its a transactional and analytics database, it will store the data in high velocity and you can query that same data for analytics right away no ETL!
How does it work is it a miracle? No not a miracle just smart and forward thinking and development and proliferation of distributed computing. I would say Hadoop laid the foundation for distributed computing with HDFS and Map Reduce. Once distributed computing started gaining acceptance everyone was joining. The architecture of distributed computing was designed with nearly linear scalability, need more power to crunch data - add more nodes. Another awesome aspect of this distributed architecture is detachment of storage from services, each can be scaled individually. This architecture is one the reasons Snowflake will demolish RedShift since detachment of storage from services delivers nearly unlimited scalability, this is the same reason Apache Pulsar will dethrone Kafka, Apache Pulsar operates on distributed storage platform (bookkeper), Kafka does not - it will hit the scalability wall sooner or later and there is no point of improving it since Pulsar already here.
HTAP is practice. Enter TiDB https://github.com/pingcap/tidb. There are a few HTAP databases in works, but I selected TiDB for a reason - it is loaded from beginning to win and elegantly simple. Just like in athletics there a talent for doing things and the same applies to technology, looking at TiDB it already has all the winning qualities:
- Apache 2.0 license, while others are busy building complex crippleware with limitations that will stifle product's adoption and most importantly turn off army of community volunteers myself included. Mixing open source and "Enterprise" is akin dissolving motor oil in the water, the process is messy and what you get in the end is completely useless - neither open source nor "Enterprise".
- TiDB is just simply build and architectured for wide adoption, there are no "Enterprise" features. I'm already trained what open source and "Enterprise" features mean - it means you get to use watered down version that is missing critical components but they still need your help to work for them for free. No thank you. CockroachDB comes to mind but sadly it chose "open source" and "Enterprise" already loaded to loose.
- Architecture. Simply superior - distributed storage, RocksDB engine. Most importantly TiDB architecture designed for limitless scalability. The awesome and really smart architecture TiDB using exiting components in the right places - why write the entire storage engine from scratch - just use RocksDB and TiKV
- MySQL compatibility. MySQL is the most widely used database today if you are even remotely considering to bring new database to the market it must absolutely must do the following:
- Integrate with MySQL - replicate data to from MySQL. TiDB has it.
- Import data to from MySQL. TiDB already has it
- MySQL similar syntax. TiDB already has it.
Last but not least it is stateless and works well in Kubernetes. Any distributed database that does not play well in container orchestration is not very useful - who is going to keep track of all the nodes, monitor and deliver HA, it must work with at least with Kubernetes.
HTAP. There is more and this is where HTAP comes in, Spark sits on top of TiDB storage TiKV and connected to it via TiSpark - simply genius. Why build database for OLTP and OLAP, we already have distributed storage - just add Spark connector and feel free to drill down on data - NO ETL !
MySQL was an absolute ruler in Internet of Things to store data, but it's reign is coming to the end in the near future, Btrees, not distributed, very hard to change schema, arcane scalability - writers and readers.
MySQL needed a successor badly, I believe TiDB is ready and will pick up the torch, it can do all the things that MySQl does today better and it is HTAP.
Join out LA and San Francisco TiDB meetups incubating and lets build a better future together!
https://www.meetup.com/Los-Angeles-TiDB-Meetup
https://www.meetup.com/San-Francisco-TiDB-Meetup
Thank you for reading !
??
Founder @Agentgrow | 3x Head of Sales
5 年But HTAP is a cool idea
Founder @Agentgrow | 3x Head of Sales
5 年Import data to MySQL sounds a lot like ELT