Timeplus as a great embodiment of "Turning the database inside out"

I rewatched Martin Kleppmann 's "Turning the database inside out with Apache Samza" talk and I make the thesis that Timeplus is a great implementation of that paradigm making it simpler and more accessible than other ways that has been done in the past decade. Thank you Martin for teaching us this way.

It's been 10 years since that talk and people have tried to do this with all sorts of combinations of distributed streaming platforms and stream processors in conjunction with traditional databases or data warehouse systems. It's actually not so trivial to learn how to do this and to deploy all this and still reason about it in a simple way. I trip over a lot of this still.

But let's start with how Timeplus does it and see whether you agree with me or not:

Streams Everywhere

In Timeplus, the fundamental abstraction used everywhere is a Stream. You create a stream (natively) and start inserting data into it. Like Kafka, it's a log or ordered or partially ordered facts. There are different types of streams (AppendOnly, Changelog KV, VersionedKV, and an upcoming "Mutable KV Stream"). Inside every stream is a Write-Ahead-Log (WAL) with configurable retention (sort of like a mini-Kafka) and a historical store which also gets kept up to date asynchronously. More on why later.

Fully precomputed caches

Timeplus tags on "View" and "Materialized views". Well, these are not tables like in a database but also just continuously Derived Streams off the back of a "normal stream" if you will. They probably should have been called "Derived Streams" and "Materialized Derived Streams". Naming is hard.

You can subscribe/consume these streams in a streaming way just the same using some APIs given (HTTP, WebSocket, Native SDKs). While consuming these streams, you can run an ad-hoc SQL Query too and get the stream of changes that meet the criteria.

Clients Subscribe to MV Changes

We really do mean Streams Everywhere! You can stream directly from any of the above streams to your UI via Client SDKs from Java to Go to Python to HTTP/Websocket.

Caches/Tables

So where are caches and tables that can be queried with a request-response style query and get one set of results?

Timeplus does not make you create another Table off the back of a stream. It gives you a table(stream_name) function and if you run the same ad-hoc SQL, it will just give you the results as of that point in time. We call it "historical query". This may be for ad-hoc exploration by an analyst or an application just wanting to retrieve some cached data from Materialized View. It serves both humans and software! Remember that historical store paired with every WAL? Well when you do a table() query, it'll just run on that and return results at lightning speed.

Better Data

I'm paraphrasing Martin here but you get all the following benefits:

Doing it this way does give you a way to decouple writing and reading. It's good for analytics as you can perform certain kinds of queries on a materialized view optimized for that use case while leaving the original written stream in place. You can Write once, Read from Many different Views. Views are just computations so won't take up space. Materialised Views actually realize the computation so takes up space but can be more performant. You can do historical point-in-time queries (including some really advanced analytics like time travel with AS OF Joins).

What about all those things Martin talked about that databases do that we'd like?

Replication

Timeplus can run as a single binary with all the above functionality built-in (try < 300MB file). Timeplus Enterprise is a distributed system including the usual sharding and replication with Multi-Raft as the underlying consensus mechanism. Because everything is a stream, it's easy to replicate streams and their derivatives.

Secondary Indexes

This is something that sort of already is there if you create Materialized Views with a different primary key. You can query with ANY field though it might require some scanning. It is generally still very fast.

But we went even further and will soon release a new kind of KV Stream with column families and more advanced secondary indices. You heard it here first. :)

Caching

Caching as implemented by the continuously updated Materialized Views above is a great advantage over application-run caches that have to be kept up to date and invalidated consistently. If you do a cold start by creating a materialized view, you can just tell it to read from the historical stream and be fully up to date. Or not, your choice. Either way, Timeplus takes care of keeping everything up to date so you can get on your life.

Bonus thing here:

You'd like to run some arbitrary business logic on your data in your views and materialized view? Well, Timeplus has SQL Functions and an embedded V8 engine to run User Defined Functions (UDFs) and User Defined Aggregation Functions (UDAFs) written in Javascript. Native Python support coming soon too for all you guys wanting to reuse your python libraries.

Conclusion

So that was a lot to take in but if you've been trying to solve these problems in a disaggregated way with #Kafka, #ksqlDB and #Flink and #Spark and whatever other components you realize how quickly the architecture gets "big". If you happen to have Kafka, we can treat Kafka topics as "External Streams" (leave your data there but be able to query those and build views/materialized views just the same). For materialized views outside Timeplus, we have support for #Clickhouse and Kafka as sinks so you can ship your data out and into your existing pipelines if that is what you choose.

For everyone else, Timeplus moves a lot more to the small and lets you build out to the bigger integrations from there. Startup a single binary and spin up Streams, Views, and Materialized views with just SQL. I probably should have said that at the top of the article. Hook up your web apps or your BI tools just like.

Simple is beautiful (at least for users). If you don't believe me, check it out. I'm here to answer questions.

要查看或添加评论,请登录

Sarwar Bhuiyan的更多文章

社区洞察

其他会员也浏览了