登录查看更多内容

Catalyst Optimizer

Arabinda Mohapatra

Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

发布日期: 2024年3月18日

+ 关注

At the core of Spark SQL is the?Catalyst Optimizer. Let's explore its different phases.

Analysis
Logical Optimization
Physical Planning
Code Generation

Analysis

In this phase catalyst optimizer convert the Unresolved Logical Plan to Logical Plan by resolving all references using the catalog.

Resolving references means whether the provided column name is valid or not, type of the column is matching with the performed computation. Spark SQL uses catalyst rules & catalog object – that tracks the tables in data sources to resolve the attribute

Logical Optimization

In this stage catalyst optimizer convert the Logical Plan to Optimized Logical Plan by applying standard rule-based optimizations to the logical plan.?These include predicate pushdown, projection pruning etc.

Physical Planning

In this phase – Spark SQL takes a Optimized Logical Plan and generates one or more Physical Plans using physical operators that match the Spark execution engine. Later Spark select the best plan using a cost model.

Here Spark also performs rule-based physical optimizations, such as filters into one Spark map operation. Also, it can push operations from the logical plan into data sources that support predicate or projection pushdown.

Code Generation

At the end, Spark only deals with the lower level construct i.e. RDD (Resilient Distributed Datasets). In this final phase catalyst optimizer generate bytecode that will run on each machine on these in-memory datasets efficiently.

要查看或添加评论，请登录

Arabinda Mohapatra的更多文章

A Deep Dive into Caching Strategies in Snowflake

2025年3月22日

A Deep Dive into Caching Strategies in Snowflake

What is Caching? Caching is a technique used to store the results of previously executed queries or frequently accessed…
A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

2025年3月16日

A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

An external table is a Snowflake feature that allows you to query data stored in an external stage as if the data were…
Apache Iceberg

2025年3月16日

Apache Iceberg

Apache Iceberg Apache Iceberg is an open-source table format designed to handle large-scale analytic datasets…
Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

2025年2月24日

Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

1. Table Storage Metrics select TABLE_SCHEMA,TABLE_CATALOG AS"DB",TABLE_SCHEMA, TABLE_NAME,sum(ACTIVE_BYTES) +…

1 条评论
Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

2025年2月23日

Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

USE WAREHOUSE LRN; USE DATABASE LRN_DB; USE SCHEMA LEARNING; ---Create a Table in snowflake as per the source data…

1 条评论
Data Loading with Snowflake's COPY INTO Command-Table

2025年2月18日

Data Loading with Snowflake's COPY INTO Command-Table

Snowflake's COPY INTO command is a powerful tool for data professionals, streamlining the process of loading data from…
SNOW-SQL in SNOWFLAKE

2025年2月17日

SNOW-SQL in SNOWFLAKE

SnowSQL is a command-line tool designed by Snowflake to interact with Snowflake databases. It allows users to execute…
Stages in Snowflake

2025年2月9日

Stages in Snowflake

Stages in Snowflake play a crucial role in data loading and unloading processes. They serve as intermediary storage…
Snowflake Tips

2025年2月8日

Snowflake Tips

??Tip 1: Use the USE statement to switch between warehouses Instead of specifying the warehouse name in every query…
SnowFlake

2025年2月8日

SnowFlake

??What is a Virtual Warehouse in Snowflake? ??A Virtual Warehouse in Snowflake is a cluster of compute resources that…

See all articles

Catalyst Optimizer

Arabinda Mohapatra

Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

Logical Optimization

Physical Planning

Code Generation

Arabinda Mohapatra的更多文章

社区洞察

其他会员也浏览了

Free Resource: The Complete SQL Cheat Sheet

Quick tips on Databricks SQL

Different Ways of Creating a DataFrame in Spark

How to identify SQL candidates for optimization ?

SQL Bulk Insert: The Quickest Approach to Handling Large Data

Common Table Expressions (CTEs)

Understanding SQL Query Order of Execution

Full vs. Incremental Loads – Data Engineering with Fabric

Spark Tidbits - Lesson 4

Logical Optimization

Physical Planning

Code Generation

Arabinda Mohapatra的更多文章

A Deep Dive into Caching Strategies in Snowflake

A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

Apache Iceberg

Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

Data Loading with Snowflake's COPY INTO Command-Table

SNOW-SQL in SNOWFLAKE

Stages in Snowflake

Snowflake Tips

SnowFlake

社区洞察

其他会员也浏览了

Free Resource: The Complete SQL Cheat Sheet

Quick tips on Databricks SQL

Different Ways of Creating a DataFrame in Spark

How to identify SQL candidates for optimization ?

SQL Bulk Insert: The Quickest Approach to Handling Large Data

Common Table Expressions (CTEs)

Understanding SQL Query Order of Execution

Full vs. Incremental Loads – Data Engineering with Fabric

Spark Tidbits - Lesson 4