Deep Dive on TimesTen Scaleout Distribution Clauses
Overview
TimesTen Scaleout is a distributed database which has a shared nothing architecture suitable for low latency, high throughput OLTP, IoT and analytic workloads. TimesTen Scaleout supports high availability, ACID transactions, SQL & PL/SQL and does not require database de-normalization.
Oracle TimesTen Scaleout supports SQL operations independent of the number of hosts. However, as the database is spread over multiple hosts, network messages are required to complete most SQL operations. A fundamental performance tuning technique is to minimize the number of network messages given common use cases. TimesTen Scaleout uses table distribution clauses to enable these optimizations.
TimesTen Scaleout supports three data distribution methods (Distribute by Hash, Distribute by Reference and Duplicate). The following picture is an example of a master-detail-detail hierarchy using all three distribution methods. This is not the only way to support this master-detail-detail hierarchy.
Figure 1
In figure 1, the 'customers' table rows are evenly spread across all hosts in the database. The 'orders' rows are not guaranteed to have the same number of rows per host as some customers have more orders than others. Given a large number of orders, it is expected that the number of orders per host will average out. The products table is duplicated across all hosts so each host will have exactly the same number of rows per host.
If an application wants to query all orders and products for a given customer, then no network messages are required for the join as all of those rows will be present on the host that contains the customer row.
Distribute by Hash
For large tables that are the 'master' table [eg customers], you want to be able to evenly spread the data across all of the hosts in your database. You also want to be able to add and remove hosts from the database without having to make dramatic changes to the distribution of data for that table. The TimesTen Scaleout distribute by hash mechanism uses a consistent hash algorithm to achieve these goals.
Figure 2.
In figure 2, the distribution key is explicitly set to be the same as the primary key. If you do not specify a distribution clause for a table, TimesTen Scaleout will by default use the PK columns for the hash distribution clause. If there is no PK then TimesTen Scaleout will randomly distribute by hash using a hidden internal column. You can also explicitly distribute by a set of columns which are different than the primary key.
In figure 1, the customer rows are evenly distributed across the four nodes in the database. The consistent hash algorithm used is not a perfect hash algorithm so there will tend to be a small variation of the number of rows per table across. Where there are more than about ten thousand rows in a table with distribute by hash, we expect that there will be no more than 2% variation in the number of rows.
Figure 3
Figure 3 shows that all of the nodes (elements) have about the same number of rows [in replica set 1] for the customers table and that there is no more than 2% variation in the number of rows for the replica set.
Distribute by Reference
When you have tables that you expect to be joined often via PK/FK, you can use the distribute by reference table clause for the detail table. When the distribute by reference syntax is used, you guarantee that the child rows in the detail table will be in the same element as the corresponding parent row. This means that a join of those PK/FK will occur on the same element. Joins on the same node will occur in memory in TimesTen Scaleout, so those joins can occur very quickly. As shown in figure 1, the customers table rows are spread evenly over all nodes, but the orders rows for the customers table PKs are always on the same node.
For distribute by reference to work, the parent table has to have a PK and the child table has to have one or more FKs. When there is more than one FK, you choose the FK that will be the most used for your workload.
Figure 4
In figure 4, the FK_CUSTOMER foreign key was chosen over the FK_ACCT_TYPE and FK_ACCT_STATUS foreign keys. This picture also shows that the table is distributed by a set of columns [ie CUST_ID] which is different than the PK columns for the table [ie ACCOUNT_ID].
Figure 5.
In figure 5, with a million rows in the accounts table, there is a fairly even distribution of rows across all of the nodes with less than 1% variation.
Duplicate Distribution
For tables that are often joined to other tables and are 'read mostly', the duplicate distribution clause can be very effective. The contents of these tables are copied/duplicated across all elements in the database. The advantage of duplicate distribution is that these tables can always enable efficient in-memory joins. The disadvantage of duplicate tables is that inserts, updates and deletes to these tables require two phase commit transactions to all other nodes in the database.
Figure 6.
In figure 6, the account_type table uses the duplicate table distribution clause.
Figure 7.
Figure 7 shows that duplicate tables always have the same number of rows. The account_type table only has six rows, but duplicate tables could have billions of rows as long as they are read mostly.
Explicit table distribution clauses are optional performance optimizations for TimesTen Scaleout. Although indexes and materialized views can also help tune data access, always tune your table distribution clauses first.
Disclaimer: These are my personal thoughts and do not represent Oracle's official viewpoint in any way, shape, or form.