With Spark runtime 1.3 in Microsoft Fabric notebooks, we now have access to the CLUSTER BY option, adding to our optimization toolkit. However, this new capability also introduces complexity when deciding which optimization techniques to use and when. This guide explores different optimization strategies based on table sizes and scenarios.
Table Size Categories and Recommendations
Understanding Optimization Techniques
- Beneficial in all scenarios
- Improves read efficiency within partitions and Z-ordered files
- Enhances Parquet storage layout
- Can be used with both CLUSTER BY and Z-Order
- Helps maintain order within clusters, reducing query shuffle
- Post-write optimization process
- Reorganizes data for better storage and query performance
- Compacts small files
- Improves read performance
- Compatible with both CLUSTER BY and Z-Order
- Post-write optimization step
- Best for medium-sized tables (10GB - 10TB)
- Optimizes for range-based queries
- Multi-dimensional clustering
- Works well with frequently queried columns
- Cannot be used with CLUSTER BY
- Reduces file count and optimizes small files
- Applied during write process
- Best for large tables (>10TB)
- Organizes data files during writing
- Optimal for columns with medium to high cardinality
- Improves write speed and data distribution
- Cannot be used with partitioning
- Reduces file count and optimizes small files
Detailed Recommendations by Scenario
Scenario 1: Small Tables (<10GB)
- Node Type: Medium Node
- Recommended Strategy: Use OPTIMIZE and V-Order only Skip partitioning (tables too small to benefit) Avoid Z-Order (resource overhead not justified)
Scenario 2: Medium Tables (10GB - 10TB)
- Node Type: Large Node
- Recommended Strategy: Implement partitioning by Year or YearQuarter Apply OPTIMIZE for file optimization Use V-Order for improved storage layout Implement Z-Order on frequently queried columns (e.g., date, category)
Scenario 3: Large Tables (>10TB)
- Node Type: XLarge Node
- Recommended Strategy: Skip partitioning (use CLUSTER BY instead) Apply OPTIMIZE for post-write optimization Use V-Order within clusters Implement CLUSTER BY on high-cardinality columns (e.g., category, subcategory)
- CLUSTER BY and partitioning are mutually exclusive - choose one based on table size and query patterns
- V-Order is beneficial in all scenarios and compatible with all other optimization techniques
- OPTIMIZE can be used alongside any other optimization strategy
- For medium-sized tables, combining partitioning with Z-Order provides optimal query performance
- For large tables, CLUSTER BY with OPTIMIZE and V-Order offers the best balance of write and read performance
This comprehensive approach to optimization ensures optimal performance while considering the specific characteristics and requirements of different table sizes in Microsoft Fabric environments.