Spark SQL Optimization

Spark SQL Optimization

 Introduction to Apache Spark SQL Optimization

“The term optimization refers to a process in which a system is modified in such a way that it work more efficiently or it uses fewer resources.”

Spark SQL is the most technically involved component of Apache Spark. Spark SQL deals with both SQL queries and DataFrame API. In the depth of Spark SQL there lies a catalyst optimizer. Catalyst optimization allows some advanced programming language features that allow you to build an extensible query optimizer.

Learn more in Apache Spark tutorial.

A new extensible optimizer called Catalyst emerged to implement Spark SQL. This optimizer is based on functional programming construct in Scala.

Catalyst Optimizer supports both rule-based and cost-based optimization. In rule-based optimization the rule based optimizer use set of rule to determine how to execute the query. While the cost based optimization finds the most suitable way to carry out SQL statement. In cost-based optimization, multiple plans are generated using rules and then their cost is computed.

Let us see Apache Spark Installation On Ubuntu.

What is the need of Catalyst Optimizer?

There are two purposes behind Catalyst’s extensible design:

We want to add the easy solution to tackle various problems with Bigdata like a problem with semi-structured data and advanced data analytics.

We want an easy way such that external developers can extend the optimizer.

Fundamentals of Catalyst Optimizer

Catalyst optimizer makes use of standard features of Scala programming like pattern matching. In the depth, Catalyst contains the tree and the set of rules to manipulate the tree. There are specific libraries to process relational queries. There are various rule sets which handle different phases of query execution like analysis, query optimization, physical planning, and code generation to compile parts of queries to Java bytecode. Let’s discuss the tree and rules is detail-

Trees

A tree is the main data type in the catalyst. A tree contains node object. For each node, there is a node. A node can have one or more children. New nodes are defined as subclasses of TreeNode class. These objects are immutable in nature. The objects can be manipulated using functional transformation.

See RDD Transformations and Actions Guide for more details about Functional transformations.

For example, if we have three node classes: worthattribute, and sub in which-

  • worth(value: Int): a constant value
  • attribute(name: String)
  • sub (left: TreeNode, right: TreeNode): subtraction of two expressions.

Let us see Best Apache Spark and Scala Books

 Rules

We can manipulate tree using rules. We can define rules as a function from one tree to another tree. With rule we can run arbitrary code on input tree, the common approach to use a pattern matching function and replace subtree with a specific structure. In a tree with the help of transform function, we can recursively apply pattern matching on all the node of a tree. We get the pattern that matches each pattern to a result.

Read Complete Article >>

See Also-



要查看或添加评论,请登录

Malini Shukla的更多文章

社区洞察

其他会员也浏览了