登录查看更多内容

APACHE HIVE

Smriti Saini

Talent Acquisition Specialist || Let's grow together!!

发布日期: 2021年9月28日

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.[3] Hive gives an SQL -like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL ) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.[4] While initially developed by Facebook , Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA).[5] [6] Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services .

Features

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio . It provides a SQL -like query language called HiveQL[8] with schema on read and transparently converts queries to MapReduce , Apache Tez[9] and Spark jobs. All three execution engines can run in Hadoop 's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0 [10] Other features of Hive include:

Different storage types such as plain text, RCFile , HBase , ORC, and others.
Metadata storage in a relational database management system , significantly reducing the time to perform semantic checks during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE , BWT , snappy , etc.
Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[11]

The first four file formats supported in Hive were plain text,[12] sequence file, optimized row columnar (ORC) format[13] and RCFile .[14] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[15] [16] Additional Hive plugins support querying of the Bitcoin Blockchain

领英推荐

Sqoop

Darshika Srivastava 9 个月前

WHAT IS SQOOP

Ashish Ranjan 1 年前

HIVE

Darshika Srivastava 2 年前

Comparison with traditional databases

The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to comply with the restrictions of Hadoop and MapReduce .

A schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is called schema on write. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is called schema on read.[22] The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically.[22]

Transactions are key operations in traditional databases. As any typical RDBMS , Hive supports all four properties of transactions (ACID ): Atomicity , Consistency , Isolation , and Durability . Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level.[26] Recent version of Hive 0.14 had these functions fully added to support complete ACID properties. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE.[27] Enabling INSERT, UPDATE, DELETE transactions require setting appropriate values for configuration properties such as hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode

要查看或添加评论，请登录

Smriti Saini的更多文章

What Is Portfolio Analytics?

2021年10月4日

What Is Portfolio Analytics?

The term portfolio analytics may be interpreted and implemented in many different ways. The first order of business…
Annuity

2021年10月1日

Annuity

An annuity is a series of payments made at equal intervals. Examples of annuities are regular deposits to a savings…
What is Actuarial Modeling?

2021年9月30日

What is Actuarial Modeling?

Actuarial modeling is the name for a set of techniques used in the insurance industry. These models are composed of…

1 条评论
Supervised vs. Unsupervised Learning: What’s the Difference?

2021年9月29日

Supervised vs. Unsupervised Learning: What’s the Difference?

The world is getting “smarter” every day, and to keep up with consumer expectations, companies are increasingly using…
Acceptance testing

2021年9月27日

Acceptance testing

In engineering and its various subdisciplines, acceptance testing is a test conducted to determine if the requirements…
SAP HANA

2021年9月25日

SAP HANA

SAP HANA (high-performance analytic appliance) is an in-memory, column-oriented, relational database management system…
Machine Learning Architecture

2021年9月23日

Machine Learning Architecture

Introduction to Machine Learning Architecture Machine Learning architecture is defined as the subject that has evolved…
AZURE DEVOPS

2021年9月22日

AZURE DEVOPS

What is Azure DevOps? Azure DevOps is a Software as a service (SaaS) platform from Microsoft that provides an…
Report Building

2021年9月21日

Report Building

Elemental development means high productivity for report developers. To enable end-users to see, understand and act…
What is Backend Developer? Skills Need for Web Development

2021年9月20日

What is Backend Developer? Skills Need for Web Development

What is Backend Development? Back-end Development refers to the server-side development. It focuses on databases…

See all articles

APACHE HIVE

Smriti Saini

Talent Acquisition Specialist || Let's grow together!!

Features

领英推荐

Comparison with traditional databases

Smriti Saini的更多文章

社区洞察

其他会员也浏览了

HIVE

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

APACHE HADOOP & HDFS

Hive

Evolution of Apache's Big Data Ecosystem

Pig Latin and its Operators

What is Hive?

Apache Hive Performance Tuning Best Practices

Getting started with Apache Spark

Features

领英推荐

Comparison with traditional databases

Smriti Saini的更多文章

What Is Portfolio Analytics?

Annuity

What is Actuarial Modeling?

Supervised vs. Unsupervised Learning: What’s the Difference?

Acceptance testing

SAP HANA

Machine Learning Architecture

AZURE DEVOPS

Report Building

What is Backend Developer? Skills Need for Web Development

社区洞察

其他会员也浏览了

HIVE

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

APACHE HADOOP & HDFS

Hive

Evolution of Apache's Big Data Ecosystem

Pig Latin and its Operators

What is Hive?

Apache Hive Performance Tuning Best Practices

Getting started with Apache Spark