登录查看更多内容

Apache Spark SQL Features

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年1月13日

Introduction to Spark SQL

In Apache Spark, Spark SQL is a module for working with structured data. Spark SQL supports distributed in-memory computations on a huge scale. It divulges the information about the structure of both computations as well as data. To perform extra optimizations, this extra information turns very helpful. We can easily execute SQL queries through it.

In addition, to read data from an existing Hive installation, we can use Spark SQL. The results come as Dataset/DataFrame When SQL run in another programming language. We can interact with the SQL interface, by using the command-line or over JDBC/ODBC.

The 3 main capabilities of using structured and semi-structured data, by Spark SQL. Such as:

It grants a DataFrame abstraction in Scala, Java, as well as Python. Also, simplifies working with structured datasets. Here, DataFrames are similar to tables in a relational database.
In various structured formats, Spark SQL can read and write data. For Example Hive Tables, JSON and Parquet.
We can query the data by using Spark SQL. Both inside a Spark program and from external tools that connect to Spark SQL.

In Spark SQL, developers can switch back and forth between different APIs, as same as Spark. Therefore, it bestows the most natural way to express the given Transformations

Spark SQL features

Integrated

Integrate is simply defined as combining or merge. Here, Spark SQL queries are integrated with Spark programs. Through Spark SQL we are allowed to query structured data inside Spark programs. This is possible by using SQL or a DataFrame that can be used in Java, Scala.

We can run streaming computation through it. Developers write a batch computation against the DataFrame / Dataset API to run it. After that to run it in a streaming fashion Spark itself increments the computation. Developers leverage the advantage of it that they don’t have to manage state, failures on own. Even no need keep the application in sync with batch jobs. Despite, the streaming job always gives the same answer as a batch job on the same data.

Learn more about Spark SQL Introduction.

Unified Data Access

To access a variety of data sources DataFrames and SQL support a common way. Data Sources like Hive, Avro, Parquet, ORC, JSON, as well as JDBC. It helps to join the data from these sources. To accommodate all the existing users into Spark SQL, it turns out to be very helpful.

High compatibility

We are allowed to run unmodified Hive queries on existing warehouses in Spark SQL. With existing Hive data, queries and UDFs, Spark SQL offers full compatibility. Also, rewrites the Meta Store and Hive front end.

Read Complete Article >>

Apache Spark SQL Features

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Introduction to Spark SQL

Spark SQL features

Integrated

Unified Data Access

High compatibility

更多精彩文章

社区洞察

其他会员也浏览了

How to execute Python code in Microsoft SQL Server on T-SQL

Spark SQL DataFrame

SQL Server with Python

Hive vs Spark

Spark SQL

A Look at SparkSQL

Step-by-Step Guide: Connecting SQL Server Database to Python with pyodbc

Connecting to a SQL Database with Python

Apache Spark-Part 2: Spark SQL/DataFrames

Clojure & JSON types from Postgresql

Introduction to Spark SQL

Spark SQL features

Integrated

Unified Data Access

High compatibility

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

How Data Science is the Backbone of Retail?

2019年7月16日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

11 Reason Why TensorFlow is So Popular

2019年6月15日

20 Deep Learning Terminologies You Must Know

2019年6月14日

TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

社区洞察

其他会员也浏览了

How to execute Python code in Microsoft SQL Server on T-SQL

Spark SQL DataFrame

SQL Server with Python

Hive vs Spark

Spark SQL

A Look at SparkSQL

Step-by-Step Guide: Connecting SQL Server Database to Python with pyodbc

Connecting to a SQL Database with Python

Apache Spark-Part 2: Spark SQL/DataFrames

Clojure & JSON types from Postgresql