APACHE HIVE

APACHE HIVE

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis.[3] Hive gives an SQL -like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL ) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.[4] While initially developed by Facebook , Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA).[5] [6] Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services .

Features

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio . It provides a SQL -like query language called HiveQL[8] with schema on read and transparently converts queries to MapReduce , Apache Tez[9] and Spark jobs. All three execution engines can run in Hadoop 's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0 [10] Other features of Hive include:

  • Different storage types such as plain text, RCFile , HBase , ORC, and others.
  • Metadata storage in a relational database management system , significantly reducing the time to perform semantic checks during query execution.
  • Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE , BWT , snappy , etc.
  • Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
  • SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[11]

The first four file formats supported in Hive were plain text,[12] sequence file, optimized row columnar (ORC) format[13] and RCFile .[14] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[15] [16] Additional Hive plugins support querying of the Bitcoin Blockchain

Comparison with traditional databases

The storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to comply with the restrictions of Hadoop and MapReduce .

A schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is called schema on write. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is called schema on read.[22] The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically.[22]

Transactions are key operations in traditional databases. As any typical RDBMS , Hive supports all four properties of transactions (ACID ): Atomicity , Consistency , Isolation , and Durability . Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level.[26] Recent version of Hive 0.14 had these functions fully added to support complete ACID properties. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE.[27] Enabling INSERT, UPDATE, DELETE transactions require setting appropriate values for configuration properties such as hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode


要查看或添加评论,请登录

Smriti Saini的更多文章

  • What Is Portfolio Analytics?

    What Is Portfolio Analytics?

    The term portfolio analytics may be interpreted and implemented in many different ways. The first order of business…

  • Annuity

    Annuity

    An annuity is a series of payments made at equal intervals. Examples of annuities are regular deposits to a savings…

  • What is Actuarial Modeling?

    What is Actuarial Modeling?

    Actuarial modeling is the name for a set of techniques used in the insurance industry. These models are composed of…

    1 条评论
  • Supervised vs. Unsupervised Learning: What’s the Difference?

    Supervised vs. Unsupervised Learning: What’s the Difference?

    The world is getting “smarter” every day, and to keep up with consumer expectations, companies are increasingly using…

  • Acceptance testing

    Acceptance testing

    In engineering and its various subdisciplines, acceptance testing is a test conducted to determine if the requirements…

  • SAP HANA

    SAP HANA

    SAP HANA (high-performance analytic appliance) is an in-memory, column-oriented, relational database management system…

  • Machine Learning Architecture

    Machine Learning Architecture

    Introduction to Machine Learning Architecture Machine Learning architecture is defined as the subject that has evolved…

  • AZURE DEVOPS

    AZURE DEVOPS

    What is Azure DevOps? Azure DevOps is a Software as a service (SaaS) platform from Microsoft that provides an…

  • Report Building

    Report Building

    Elemental development means high productivity for report developers. To enable end-users to see, understand and act…

  • What is Backend Developer? Skills Need for Web Development

    What is Backend Developer? Skills Need for Web Development

    What is Backend Development? Back-end Development refers to the server-side development. It focuses on databases…

社区洞察

其他会员也浏览了