Accelerate Performance by choosing right table type in Azure Database for PostgreSQL - Hyperscale (Citus)
Hyperscale (Citus) Introduction:
If you’re looking to scale out your multi-tenant app on Azure - or build a real-time operational analytics app that gives your customers sub-second performance even with billions of rows - you can now scale out PostgreSQL horizontally with Azure Database for PostgreSQL - Hyperscale (Citus). It is a built-in deployment option in Azure Database for PostgreSQL. It scales out your data across multiple physical nodes, with the underlying data being sharded into much smaller bits. Unlike traditional sharded systems, your application doesn’t have to learn how to shard the data. This means you can go back to building new features and functionality, without having to deal with a massive database that is a pain to maintain. When you provision a Azure Database for PostgreSQL - Hyperscale (Citus), you’ll have a coordinator node which is responsible for distributed planning, query routing, aggregation, as well as a number of worker nodes which you specify. The database cluster will come already setup with the Citus extension and pre-configured so you can begin adding data right away.
Hyperscale (Citus) Features:
- Scale out horizontally: Break free from the limits of single-node PostgreSQL & scale out across 100s of nodes.
- Blazing performance: Ingest & query your database in real time, with sub-second responses across billions of rows.
- Simplified infrastructure: Save time. Run both transactions and analytics in one database. Also, avoid the painful costs of manual sharding.
- Stay current with PostgreSQL innovations: Developed as an open source extension, leverage all your PostgreSQL expertise and its latest innovations.
Choose between Single Server and Hyperscale (Citus):
The decision depends on the type of workloads you run.
Use Single Server with:
- Transactional and operational analytics workloads
- Apps requiring JSON, geospatial support, or full-text search
- Cloud-native apps built with modern frameworks
Use Hyperscale (Citus) with:
- Scaling PostgreSQL multi-tenant SaaS applications
- Real-time operational analytics
- Building High-throughput transactional applications
Terminology:
- Coordinator: Stores Metadata. Node which application connects to.
- Worker / Data nodes: Nodes which store data in form of shards.
- Sharding: Process of dividing data among nodes.
- Shards: A PostgreSQL table on the worker nodes containing a subset of rows.
- Distribution column / key: Column used to distribute data among nodes.
Table Types:
The distributed nature of Citus gives you new flexibility when it comes to modeling your data. This is good. But you’ll need to think about how to model your data and what type of database tables to use. The way you query your data ultimately determines how you can best model each table.
here are three types of tables in a Hyperscale (Citus) server group, each used for different purposes.
- Distributed Tables
- Reference Tables
- Local Tables
Type 1: Distributed tables
The first type, and most common, is distributed tables. They appear to be normal tables to SQL statements, but they're horizontally partitioned across worker nodes. What this means is that the rows of the table are stored on different nodes, in fragment tables called shards.
Hyperscale (Citus) runs not only SQL but DDL statements throughout a cluster. Changing the schema of a distributed table cascades to update all the table's shards across workers.
Points to be remembered:
- Large tables (>10GB)
- All tables are be co-located
- Enables localized and fast joins on workers
- Examples: transactions, events etc
SELECT create_distributed_table(table_name, column_name);
Choose distribution columns in Azure Database for PostgreSQL – Hyperscale (Citus):
Choosing each table's distribution column is one of the most important modeling decisions you'll make. Azure Database for PostgreSQL - Hyperscale (Citus) stores rows in shards based on the value of the rows' distribution column.
The correct choice groups related data together on the same physical nodes, which makes queries fast and adds support for all SQL features. An incorrect choice makes the system run slowly and won't support all SQL features across nodes. In this article I am explaining distribution column tips for the most common Hyperscale (Citus) scenarios.
Sharding Patterns - Distribution by Tenant ID:
The multi-tenant architecture uses a form of hierarchical database modeling to distribute queries across nodes in the distributed cluster. The top of the data hierarchy is known as the tenant id, and needs to be stored in a column on each table. Citus inspects queries to see which tenant id they involve and routes the query to a single worker node for processing, specifically the node which holds the data shard associated with the tenant id. Running a query with all relevant data placed on the same node is called Table Co-Location. The following picture illustrates the same.
Please find below the typical multi-tenant schema:
CREATE TABLE accounts ( id bigint, name text NOT NULL, image_url text NOT NULL, PRIMARY KEY (id) ); CREATE TABLE ads ( id bigint, account_id bigint, campaign_id bigint, name text NOT NULL, image_url text NOT NULL, target_url text NOT NULL, impressions_count bigint DEFAULT 0 NOT NULL, clicks_count bigint DEFAULT 0 NOT NULL, PRIMARY KEY (account_id, id), FOREIGN KEY (account_id) REFERENCES accounts ); CREATE TABLE clicks ( id bigint, account_id bigint, ad_id bigint, clicked_at timestamp without time zone NOT NULL, site_url text NOT NULL, cost_per_click_usd numeric(20,10), user_ip inet NOT NULL, user_data jsonb NOT NULL, PRIMARY KEY (account_id, id), FOREIGN KEY (account_id) REFERENCES accounts, FOREIGN KEY (account_id, ad_id) REFERENCES ads (account_id, id) ); SELECT create_distributed_table('accounts', 'id'); SELECT create_distributed_table('ads', 'account_id'); SELECT create_distributed_table('clicks', 'account_id');
SELECT create_distributed_table('clicks', 'account_id');
Sharding Patterns - Distribution by Entity ID:
While the multi-tenant architecture introduces a hierarchical structure and uses data co-location to parallelize queries between tenants, real-time architectures depend on specific distribution properties of their data to achieve highly parallel processing. We use “entity id” as a term for distribution columns in the real-time model, as opposed to tenant ids in the multi-tenant model. Typical entities are users, hosts, or devices.
Real-time queries typically ask for numeric aggregates grouped by date or category. Citus sends these queries to each shard for partial results and assembles the final answer on the coordinator node. Queries run fastest when as many nodes contribute as possible, and when no individual node bottlenecks.
The more evenly a choice of entity id distributes data to shards the better. At the least the column should have a high cardinality. For comparison, a “status” field on an order table is a poor choice of distribution column because it assumes at most a few values. These values will not be able to take advantage of a cluster with many shards. The row placement will skew into a small handful of shards.
Of columns having high cardinality, it is good additionally to choose those that are frequently used in group-by clauses or as join keys. Distributing by join keys co-locates the joined tables and greatly improves join speed. Real-time schemas usually have few tables, and are generally centered around a big table of quantitative events.
Multi- tenant Apps Best Practices:
- Partition distributed tables by a common tenant_id column. For instance, in a SaaS application where tenants are companies, the tenant_id is likely to be the company_id.
- Convert small cross-tenant tables to reference tables. When multiple tenants share a small table of information, distribute it as a reference table.
- Restrict filter all application queries by tenant_id. Each query should request information for one tenant at a time.
Real-Time Apps Best Practices:
- Choose a column with high cardinality as the distribution column. For comparison, a Status field on an order table with values New, Paid, and Shipped is a poor choice of distribution column. It assumes only those few values, which limits the number of shards that can hold the data, and the number of nodes that can process it. Among columns with high cardinality, it's also good to choose those columns that are frequently used in group-by clauses or as join keys.
- Choose a column with even distribution. If you distribute a table on a column skewed to certain common values, data in the table tends to accumulate in certain shards. The nodes that hold those shards end up doing more work than other nodes.
- Distribute fact and dimension tables on their common columns. Your fact table can have only one distribution key. Tables that join on another key won't be co-located with the fact table. Choose one dimension to co-locate based on how frequently it's joined and the size of the joining rows.
- Change some dimension tables into reference tables. If a dimension table can't be co-located with the fact table, you can improve query performance by distributing copies of the dimension table to all of the nodes in the form of a reference table.
Time-Series Data Best Practices:
- Don't choose a timestamp as the distribution column. Choose a different distribution column. In a multi-tenant app, use the tenant ID, or in a real-time app use the entity ID.
- Use PostgreSQL table partitioning for time instead. Use table partitioning to break a large table of time-ordered data into multiple inherited tables with each table containing different time ranges. Distributing a PostgreSQL partitioned table in Hyperscale (Citus) creates shards for the inherited tables.
Type 2: Reference tables
A reference table is a type of distributed table whose entire contents are concentrated into a single shard. The shard is replicated on every worker. Queries on any worker can access the reference information locally, without the network overhead of requesting rows from another node. Reference tables have no distribution column because there's no need to distinguish separate shards per row.
Reference tables are typically small and are used to store data that's relevant to queries running on any worker node. An example is enumerated values like order statuses or product categories.
Points to be remembered:
- Replicated to all the nodes
- Small tables < 10GB
- Efficient joins with distributed tables
- Cannot have sharding dimension
- Examples: countries, categories
SELECT create_reference_table(table_name);
Type 3: Local tables
When you use Hyperscale (Citus), the coordinator node you connect to is a regular PostgreSQL database. You can create ordinary tables on the coordinator and choose not to shard them.
A good candidate for local tables would be small administrative tables that don't participate in join queries. An example is a users table for application sign-in and authentication.
Points to be remembered:
- Plain Postgres tables on the coordinator node.
- Admin Tables that don’t interact with main tables
- Separate micro-service that doesn’t need sharding
Migrating database tables to Hyperscale (Citus) by Example:
If migrating an existing database to the Citus multi-tenant architecture then some of your tables may lack a column for the application-specific tenant id. You will need to add one and fill it with the correct values. This will de-normalize your tables slightly. For more details lets learn with a concrete example of backfilling the tenant/entity id,
1) Review Table modeling for Contoso Sales App
2) Identification of Table types
Important:
Co-location is based on data-type of the distribution column. Not the name of the column.
3) Decide the Table Types and identify where the back-filling is required
For more information:
Distributed data in Azure Database for PostgreSQL – Hyperscale (Citus)
Choose distribution columns in Azure Database for PostgreSQL – Hyperscale (Citus)
Table colocation in Azure Database for PostgreSQL – Hyperscale (Citus)
Database Table Types with Citus and PostgreSQL
Five sharding data models and which is right
Sr Specialist @Microsoft. #Enterprise Architect #App Modernization Expert #Kubernetes Evangelist
4 年Keep writing superb