登录查看更多内容

Spark SQL Optimization

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年2月1日

Introduction to Apache Spark SQL Optimization

“The term optimization refers to a process in which a system is modified in such a way that it work more efficiently or it uses fewer resources.”

Spark SQL is the most technically involved component of Apache Spark. Spark SQL deals with both SQL queries and DataFrame API. In the depth of Spark SQL there lies a catalyst optimizer. Catalyst optimization allows some advanced programming language features that allow you to build an extensible query optimizer.

Learn more in Apache Spark tutorial.

A new extensible optimizer called Catalyst emerged to implement Spark SQL. This optimizer is based on functional programming construct in Scala.

Catalyst Optimizer supports both rule-based and cost-based optimization. In rule-based optimization the rule based optimizer use set of rule to determine how to execute the query. While the cost based optimization finds the most suitable way to carry out SQL statement. In cost-based optimization, multiple plans are generated using rules and then their cost is computed.

Let us see Apache Spark Installation On Ubuntu.

What is the need of Catalyst Optimizer?

There are two purposes behind Catalyst’s extensible design:

We want to add the easy solution to tackle various problems with Bigdata like a problem with semi-structured data and advanced data analytics.

We want an easy way such that external developers can extend the optimizer.

Fundamentals of Catalyst Optimizer

Catalyst optimizer makes use of standard features of Scala programming like pattern matching. In the depth, Catalyst contains the tree and the set of rules to manipulate the tree. There are specific libraries to process relational queries. There are various rule sets which handle different phases of query execution like analysis, query optimization, physical planning, and code generation to compile parts of queries to Java bytecode. Let’s discuss the tree and rules is detail-

Trees

A tree is the main data type in the catalyst. A tree contains node object. For each node, there is a node. A node can have one or more children. New nodes are defined as subclasses of TreeNode class. These objects are immutable in nature. The objects can be manipulated using functional transformation.

See RDD Transformations and Actions Guide for more details about Functional transformations.

For example, if we have three node classes: worth, attribute, and sub in which-

worth(value: Int): a constant value
attribute(name: String)
sub (left: TreeNode, right: TreeNode): subtraction of two expressions.

Let us see Best Apache Spark and Scala Books

Rules

We can manipulate tree using rules. We can define rules as a function from one tree to another tree. With rule we can run arbitrary code on input tree, the common approach to use a pattern matching function and replace subtree with a specific structure. In a tree with the help of transform function, we can recursively apply pattern matching on all the node of a tree. We get the pattern that matches each pattern to a result.

Read Complete Article >>

See Also-

Apache Spark RDD vs DataFrame vs DataSet.
Apache Spark Performance tuning best practices.

要查看或添加评论，请登录

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling type…
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and that’s why it is often regarded…

3 条评论
Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with that…
How Data Science is the Backbone of Retail?

2019年7月16日

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in today’s digital world, data…
How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

“The goal is to turn data into information, and information into insight” Data Scientist is an analytical data expert…
What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

What’s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at the…

1 条评论
11 Reason Why TensorFlow is So Popular

2019年6月15日

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programming…
20 Deep Learning Terminologies You Must Know

2019年6月14日

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron It’s one of the best from the Deep Learning Terminologies.

2 条评论
TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

TensorFlow Performance Optimization – Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware tools…
Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, let’s discuss them:…

See all articles

Spark SQL Optimization

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Introduction to Apache Spark SQL Optimization

What is the need of Catalyst Optimizer?

Fundamentals of Catalyst Optimizer

Trees

Rules

Malini Shukla的更多文章

社区洞察

其他会员也浏览了

Aggregation Functions in PySpark

SQL

BigData Analytics with PySpark

Mastering Spark Session Creation and Configuration in Apache Spark

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

How to implement Apache Spark in Data Processing and Analytics?

Handling Nested Schema in Apache Spark

"Transform Your Career: Master SQL"

Navigating Your Early Data Career: Python, SQL, or Power BI?

Introduction to Apache Spark SQL Optimization

What is the need of Catalyst Optimizer?

Fundamentals of Catalyst Optimizer

Trees

Rules

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

What’s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization – Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

社区洞察

其他会员也浏览了

Aggregation Functions in PySpark

SQL

BigData Analytics with PySpark

Mastering Spark Session Creation and Configuration in Apache Spark

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

How to implement Apache Spark in Data Processing and Analytics?

Handling Nested Schema in Apache Spark

"Transform Your Career: Master SQL"

Navigating Your Early Data Career: Python, SQL, or Power BI?