NewSQL: is the code back in the DB?
Picture taken from Bryn Llewellyn presentation at PostgresWorld

NewSQL: is the code back in the DB?

TL;DR Distributed database can scale with the same elasticity as application servers and this allows us to colocate code and data again. Especially when the NewSQL database has the full power of SQL and PL/pgSQL (like YugabyteDB ??)

The idea of NewSQL databases is about scaling out a relational database, with the same elasticity as NoSQL databases (distributed to multiple active nodes), but without the burden of coding the transaction management and access path in the application. NewSQL databases are ACID, allow joins, foreign keys, strongly consistent secondary indexes... This started with Google Cloud Spanner, adding a few SQL features on top of a distributed storage. And because the need for it reached many applications and enterprises, Cockroach came with some more SQL features, protocol-compatibility with PostgreSQL, and running on any platform. This clearly exposed the need for relational features and SQL in enterprise applications. Of course, this is not new, but had been forgotten with NoSQL because, at that time, scaling out was not compatible with strong consistency.

Now, as the future is about SQL still being there with distributed databases, Yugabyte has built an open-source database for it where, on top of the ACID distributed document store (DocDB), the SQL layer has all the power of PostgreSQL. This goes beyond simple SQL DDL and DML. All modern SQL is there: Triggers, Procedures, User-Defined Functions, ready-to-use XML and JSON transformations....

But... in the past years, the trend was to put less processing in the database, use it only as a bit bucket to store data, and put all the processing in the application. So, you may wonder if all those SQL and Stored Procedures will be be used?

I'm writing this article after reading Gunnar Morling question on Twitter (https://twitter.com/gunnarmorling/status/1416499675683856387?s=20), whether the transformation from SELECT query results should be transformed to JSON on the database side or returned as relational result sets to be transformed by the application. As Lukas Eder mentioned in one answer, the overhead of doing it on the application side is additional roundtrips (multiple queries) and more data transfer (repeated column values). But, on the other side, the CPU usage of the database servers is always critical and people prefer to offload most of the processing to the application servers, which can run on multiple compute nodes. But... when you can scale out the database like the application servers... things changes, right?

I've heard a lot of "reasons" not to use stored procedures and to limit the database calls to simple DML without too much processing. Especially with Oracle. Let's look at the major motivations.

  • People thinking that "PL/SQL" is dead since Oracle supports Java stored procedures. This is completely wrong when you look at the new features in recent versions, like native compilation, continuous deployment, fine-grain dependency... Just rumors.
  • On the other side, there are less and less PL/SQL developers on the market, and juniors don't really want to go there. Here is an understandable reason I've heard: "if we do it in PL/SQL we are 3 in the team to be able to maintain it, but with Java or Python we have armies of developers available". This is a valid with Oracle but not in PostgreSQL where, thanks to extensions, you can code your procedures in any language.
  • People thinking that using "advanced" SQL, like JSON functions, will lock them with one database vendor and they cannot move to another. First, very few applications are directly moved to another database. And if they do, re-writing for a different syntax is not so difficult. There are tools for that, as long as the target databases is not too limited in SQL features. And, anyway, PostgreSQL is very popular popular, and follows strictly the SQL standard. You have a wide choice of commercial (like AWS Aurora), to fully open-source (like YugabyteDB) databases, offering a different storage engine, but the same API where you don't have to change your code.
  • The cost of compute. Of course, this makes sense with Oracle where you pay software licenses on the number of processors: you don't want to process there anything that can be processed on another node. However, people often forget that offloading to the application often requires more network round-trips between the application and the database, and many context switches, and then, finally, more CPU consumption on both sides. With open-source databases you don't have this per-processor license constraint but the compute instance for a database may still be more expensive than the ones for the application. However, what you do in one database call will always outperform the offloading which requires multiple round-trips.
  • Still with CPU, and RAM, many applications have to scale-out when the business grows. Of course, you can run a lot in one server, and I've always recommended some tuning before running into hardware scaling. And calling stored procedure is a way to optimize (see this example with pgbench). But when you reach the limit, you can't re-design the whole application. You just want to add a node. Maybe automatically for short peaks of load. This was a reason to design the application to run most the the processing on application servers. And this is still valid for monolith databases where you can write on one node only. Offloading some queries on reader nodes, or manual table sharding, may help, but this goes again to higher complexity in the application design.
  • Stored procedures was also a problem for application deployment. Development teams tried to minimize the impact of application releases on the database as this usually involved another team (DBA) and it was hard to get a continuous integration environment synchronized with the application. However, this has changed in the past years, DevOps, Liquibase/Flyway, Terraform, persistent volumes for databases, managed cloud services... today the database is just another component of the application, fully automated with Infrastructure as Code. The code can be deployed in the database in the same way as for the application.

With some of those reasons, and a few more, the use of stored procedures, and complex SQL functions, had been abandoned. Higher operation cost, because of the additional CPU and network round-trips, but this was accepted to run most of the code on scalable components. However, all this changes again with NewSQL databases. Because distributed databases can scale-out in exactly the same way as application servers. Just add a node (or have auto-scaling do it for you) and the data and the load is automatically balanced. Without any downtime, you can run on more nodes, multiple AZ, regions, cloud providers, for better availability and lower cost. You can even move to cheaper instances when available (like ARM processors). The database nodes are operated in the same way as the application ones, the stateful synchronization and persistent storage being managed automatically by the system.

With PostgreSQL you have all features to encapsulate the services into stored procedures, in the language of your choice. And the powerful SQL functions can do the heavy lifting and return all data in one JSON. And when this runs on a distributed database, this processing can scale without the risk to reach one node limits. Putting all those ideas together, you can understand why Yugabyte decided not only to be wire-compatible with PostgreSQL, but use its code for the query layer. Each node is running a stateless postgres, with stored procedures and powerful SQL functions, and your application can connect to any node (for load-balancing or geo-proximity) to scale-out this processing. And all of those nodes are also writer nodes, to persist data, and replicate to other nodes, thanks to the DocDB distributed engine. This is the "cloud-native" architecture of YugabyteDB: a stateless query layer with all PostgreSQL features and a distributed document store for the strong consistency and persistence.

The header picture of this article is taken from Bryn Llewellyn's slides about the YugabyteDB design. More information about the PostgreSQL compatibility in the doc: https://docs.yugabyte.com/latest/explore/ysql-language-features/

Aivars Kalvans

FinTech Developer | Software Architect | Consultant

3 年

Instead of putting the code back in the DB, I like the approach of ok.ru by embedding a distributed database into application code https://youtu.be/Fhu-qh5eGXc?t=1273

要查看或添加评论,请登录

Franck Pachot的更多文章

社区洞察

其他会员也浏览了