What is Z-Order on Databricks?
What is Z-Order?
We can compare the z-order with the cluster index in Oracle (I am a fan of SQL and databases, so my comparisons on databases). Z-Order will cluster the data in the Z-Order definition so that rows like column values from the Z-Order definition are collocated in as few files as possible.
RDBMS system we have indexes to improve the performances indexes also will create files to store the mapping information if the file size increases we need to face another problem to solve.
Delta lake delta formats underline Parquet files to make range selection on object storage more efficient. Combined with the stats collection process and data skipping, Z-Order is similar to seek vs scan operations in databases, which indexes solved, without creating another compute bottleneck.
The below image is an example of z -ordering table scan approach.
We can achieve the z-order using the Collecting statistics if its a long string attribute this is a costly operation, to reduce this we can use the delta.dataSkippingNumIndexedCols table proprieties.
Z-Ordering, best practices.
Limit the number of columns in the Z-Order (one to four attributes).
suggested unique attribute (most likely join columns).
once the data load is completed if we can execute the Z-Order back end files will be sorted and divided based on the Z-Order definition.
if we can do fact and dimension tables with the same set of z-order and use the same attributes for joining strategy that improves the performances.
Sample code base :
OPTIMIZE employee ZORDER BY (dep_id_fk) ; OPTIMIZE dept ZORDER BY (dep_id_pk) ; select * from employee a join dept b on a.dep_id_pk=b.dep_id_pk;
Ref: https://docs.databricks.com/delta/data-skipping.html
Thank you!