Spark SQL

Spark SQL

Spark introduces a programming module for structured data processing called Spark SQL. It provides a programming abstraction called Data Frame and can act as distributed SQL query engine.

Features of Spark SQL

  • Integrated?? Seamlessly mix SQL queries with Spark programs. Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Python, Scala and Java. This tight integration makes it easy to run SQL queries alongside complex analytic algorithms.
  • Unified Data Access?? Load and query data from a variety of sources. Schema-RDDs provide a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JSON files.
  • Hive Compatibility?? Run unmodified Hive queries on existing warehouses. Spark SQL reuses the Hive frontend and MetaStore, giving you full compatibility with existing Hive data, queries, and UDFs. Simply install it alongside Hive.
  • Standard Connectivity?? Connect through JDBC or ODBC. Spark SQL includes a server mode with industry standard JDBC and ODBC connectivity.
  • Scalability?? Use the same engine for both interactive and long queries. Spark SQL takes advantage of the RDD model to support mid-query fault tolerance, letting it scale to large jobs too. Do not worry about using a different engine for historical data.

Spark SQL Architecture

No alt text provided for this image

Spark SQL?Data Frames

The Data Frame in Spark SQL overcomes these limitations of RDD.?Spark Data Frame?is Spark 1.3 release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in?R/Python. We can create Data Frame using:

  • Structured data files
  • Tables in Hive
  • External databases
  • Using existing RDD

Spark SQL?Datasets

Spark Dataset?is an interface added in version Spark 1.6. it is a distributed collection of data. Dataset provides the benefits of RDDs?along with the benefits of Apache Spark SQL’s optimized execution engine. Here an encoder is a concept that does conversion between JVM objects and tabular representation.

A Dataset can be made using JVM objects and after that, it can be manipulated using functional transformations (map, filter etc.). The Dataset API is accessible in?Scala and Java. Dataset API is not supported by Python. But because of the dynamic nature of Python, many benefits of Dataset API are available. The same is the case with?R. Using a Dataset of rows we represent Data Frame in Scala and Java.

Spark Catalyst Optimizer

The?optimizer?used by Spark SQL is?Catalyst optimizer. It optimizes all the queries written in Spark SQL and Data Frame DSL. The optimizer helps us to run queries much faster than their counter RDD part. This increases the performance of the system.

Spark Catalyst?is a library built as a rule-based system. And each rule focusses on the specific optimization. For example,??focus on eliminating?constant expression?from the query.

Uses of Apache Spark SQL

  • It executes SQL queries.
  • We can read data from existing Hive installation?using SparkSQL.
  • When we run SQL within another programming language we will get the result as Dataset/Data Frame.

Functions defined by Spark SQL

a. Built-In function

It offers a built-in function to process the column value. We can access the inbuilt function by importing the following command: Import org.apache.spark.sql.functions

b. User Defined Functions(UDFs)

UDF allows you to create the user define functions based on the user-defined functions in Scala.

c. Aggregate functions

These operate on a group of rows and calculate a single return value per group.

d. Windowed Aggregates(Windows)

These operate on a group of rows and calculate a single return value for each row in a group.

Advantages of Spark SQL

No alt text provided for this image

Disadvantages of Spark SQL

No alt text provided for this image

Conclusion – Spark SQL

In conclusion to Spark SQL, it is a module of Apache Spark that analyses the structured data. It provides Scalability, it ensures high compatibility of the system. It has standard connectivity through JDBC or ODBC. Thus, it provides the most natural way to express the Structured Data.






要查看或添加评论,请登录

A vamsi的更多文章

  • Text mining and word cloud

    Text mining and word cloud

    Text mining methods allow us to highlight the most frequently used keywords in a paragraph of texts. One can create a…

    1 条评论
  • SDP-3 Indian Culture and Heritage

    SDP-3 Indian Culture and Heritage

    Batch Number : 4230 Introduction : Culture plays a pivotal role in the development of any country. A Culture of a…

  • SDP-2 (USER RESEARCH)

    SDP-2 (USER RESEARCH)

    SDP stands for the Skill development project. Here is my article that explains the research part of the SDP Project.

    4 条评论

社区洞察

其他会员也浏览了