登录查看更多内容

Hive

Nivedita singh

Hr Professional

发布日期: 2023年3月14日

Apache Hive?is a?data warehouse?software project built on top of?Apache Hadoop?for providing data query and analysis.Hive gives an?SQL-like?interface?to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the?MapReduce?Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.?While initially developed by?Facebook, Apache Hive is used and developed by other companies such as?Netflix?and the?Financial Industry Regulatory Authority?(FINRA).?Amazon maintains a software fork of Apache Hive included in?Amazon Elastic MapReduce?on?Amazon Web Services.

Apache Hive supports analysis of large datasets stored in Hadoop's?HDFS?and compatible file systems such as?Amazon S3?filesystem and?Alluxio. It provides a?SQL-like query language called HiveQL?with schema on read and transparently converts queries to?MapReduce, Apache Tez and?Spark?jobs. All three execution engines can run in?Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provided indexes, but this feature was removed in version 3.0??Other features of Hive include:

Different storage types such as plain text,?RCFile,?HBase, ORC, and others.
Metadata storage in a?relational database management system, significantly reducing the time to perform semantic checks during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms including?DEFLATE,?BWT,?snappy, etc.
Built-in?user-defined functions?(UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

领英推荐

WHAT IS SQOOP

Ashish Ranjan 1 年前

Sqoop

Darshika Srivastava 1 年前

Evolution of Apache's Big Data Ecosystem

Vivek Soni 7 个月前

By default, Hive stores metadata in an embedded?Apache Derby?database, and other client/server databases like?MySQL?can optionally be used.

The first four file formats supported in Hive were plain text,?sequence file, optimized row columnar (ORC) format and?RCFile.Apache Parquet?can be read via plugin in versions later than 0.10 and natively starting at 0.13.

Metastore: Stores metadata for each of the tables such as their schema and location. It also includes the partition metadata which helps the driver to track the progress of various data sets distributed over the cluster.?The data is stored in a traditional?RDBMS?format. The metadata helps the driver to keep track of the data and it is crucial. Hence, a backup server regularly replicates the data which can be retrieved in case of data loss.
Driver: Acts like a controller which receives the HiveQL statements. It starts the execution of the statement by creating sessions, and monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of a HiveQL statement. The driver also acts as a collection point of data or query results obtained after the Reduce operation.[16]
Compiler: Performs compilation of the HiveQL query, which converts the query to an execution plan. This plan contains the tasks and steps needed to be performed by the?Hadoop?MapReduce?to get the output as translated by the query. The compiler converts the query to an?abstract syntax tree?(AST). After checking for compatibility and compile time errors, it converts the AST to a?directed acyclic graph?(DAG).The DAG divides operators to MapReduce stages and tasks based on the input query and data.
Optimizer: Performs various transformations on the execution plan to get an optimized DAG. Transformations can be aggregated together, such as converting a pipeline of joins to a single join, for better performance.?It can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance and scalability. However, the logic of transformation used for optimization used can be modified or pipelined using another optimizer.An optimizer called YSmart?is a part of?Apache Hive. This is a correlated optimizer, which merges correlated MapReduce jobs into a single MapReduce job, significantly reducing the execution time.
Executor: After compilation and optimization, the executor executes the tasks. It interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency gets executed only if all other prerequisites are run.
CLI, UI, and?Thrift Server: A?command-line interface?(CLI) provides a?user interface?for an external user to interact with Hive by submitting queries, instructions and monitoring the process status. Thrift server allows external clients to interact with Hive over a network, similar to the?JDBC?or?ODBC?protocols.

要查看或添加评论，请登录

Nivedita singh的更多文章

Front-End vs. Back-End: What’s the Difference?

2023年4月20日

Front-End vs. Back-End: What’s the Difference?

Front-End Development Front-end development focuses on the user-facing side of a website. Front-end developers ensure…
Talend

2023年4月13日

Talend

What is Talend? Talend is an open source software platform which offers data integration and data management solutions.…
Snowflake

2023年4月8日

Snowflake

Snowflake Inc. is a cloud computing–based data cloud company based in Bozeman, Montana.
Data Profiling

2023年4月3日

Data Profiling

What Is Data Profiling? Data profiling is the process of reviewing source data, understanding structure, content and…
Data Engineering

2023年4月1日

Data Engineering

In the modern world, it is tough to think of any industry that has not been revolutionized by data science. Although…
Data Scrubbing

2023年3月31日

Data Scrubbing

What is Data Scrubbing? If in the course of doing household chores, someone told you to clean the floor, you most…
Computer Vision

2023年3月29日

Computer Vision

What is computer vision? Computer vision is a field of artificial intelligence (AI) that enables computers and systems…
CSS

2023年3月28日

CSS

What is CSS? Cascading Style Sheets (CSS) is used to format the layout of a webpage. With CSS, you can control the…
Microsoft 365

2023年3月27日

Microsoft 365

Microsoft 365 is a product family of productivity software, collaboration and cloud-based services owned by Microsoft…

2 条评论
Front-End Developer

2023年3月25日

Front-End Developer

Front-End Front-End Development Front-end development focuses on the user-facing side of a website. Front-end…

See all articles

Hive

Nivedita singh

Hr Professional

领英推荐

Nivedita singh的更多文章

社区洞察

其他会员也浏览了

Difference between RDBMS and HBase

HIVE

Unleashing the Power of Big Data with Apache Hive

Bulk Data Load using Apache Sqoop

Beginners Guide to Apache HIVE.

The Whole Ramayan of Cassandra DB

Configuring Hive with HDFS & MapReduce Cluster backend

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Apache Sqoop

领英推荐

Nivedita singh的更多文章

Front-End vs. Back-End: What’s the Difference?

Talend

Snowflake

Data Profiling

Data Engineering

Data Scrubbing

Computer Vision

CSS

Microsoft 365

Front-End Developer

社区洞察

其他会员也浏览了

Difference between RDBMS and HBase

HIVE

Unleashing the Power of Big Data with Apache Hive

Bulk Data Load using Apache Sqoop

Beginners Guide to Apache HIVE.

The Whole Ramayan of Cassandra DB

Configuring Hive with HDFS & MapReduce Cluster backend

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Apache Sqoop