ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Apache Hive: A Data Warehouse Solution on Hadoop

Lashman Bala

Data Engineer

å‘å¸ƒæ—¥æœŸ: 2025å¹´2æœˆ28æ—¥

Introduction

Apache Hive is a data warehouse infrastructure built on top of Hadoop that allows users to query and analyze large datasets using HiveQL, a SQL-like language. It enables easy data summarization, ad hoc querying, and large-scale data processing by abstracting complex MapReduce, Tez, or Spark jobs into simple SQL queries.

Hive is widely used in big data analytics for handling massive datasets stored in HDFS (Hadoop Distributed File System), making it an essential tool for data engineers and analysts.

Why Apache Hive?

Before Hive, analyzing large-scale data in Hadoop required writing complex MapReduce programs in Java or Python. Hive simplifies this process by allowing users to write SQL-like queries instead of writing code manually.

Apache Hive Architecture

Hive follows a layered architecture, consisting of multiple components:

1. Hive Clients

CLI (Command-Line Interface)
JDBC/ODBC drivers for connecting BI tools
Web interfaces like Beeline

2. Hive Services

HiveServer2: Manages client connections and query execution.
MetaStore: Stores metadata about tables, schemas, partitions, and data locations.
Compiler: Converts HiveQL queries into execution plans.
Execution Engine: Orchestrates query execution using MapReduce, Tez, or Spark.

3. Hive Storage

Stores data in HDFS (Hadoop Distributed File System).
Supports various file formats like Text, ORC, Parquet, and Avro.

Key Features

SQL-Like Interface: Query big data with HiveQL, similar to traditional SQL.
Schema on Read: Unlike RDBMS, schema is applied at query time rather than during data ingestion.
Scalability: Processes petabytes of data using Hadoop's distributed architecture.
Supports ACID Transactions: Ensures Atomicity, Consistency, Isolation, and Durability in data operations.
Flexible Storage: Works with HDFS, Amazon S3, and other distributed file systems.
Supports Partitioning & Bucketing: Enhances query performance by logically organizing data.
Extensibility: Custom functions can be written using User-Defined Functions (UDFs).

Hive MetaStore: Embedded vs. External

Hive requires a MetaStore to store table metadata, schemas, partitions, and locations.

1. Embedded MetaStore

Uses Derby database (default)
Not recommended for production
Only allows one connection at a time
Suitable for testing or small-scale usage

2. External MetaStore

Uses MySQL, PostgreSQL, or any RDBMS
Supports multiple concurrent users
Recommended for large-scale applications
Allows multiple Hive sessions to share the same metadata

?? Best Practice: Use an external MetaStore (MySQL/PostgreSQL) in production environments.

Hive Tables: Internal vs. External

Hive supports two types of tables:

1. Internal (Managed) Tables

Hive owns both data & metadata.
Data is stored in HDFS under /user/hive/warehouse/.
If a table is dropped, data is also deleted.
Suitable for temporary or intermediate processing.

Example:

CREATE TABLE sales_data ( 
                      id INT, 
                     product STRING,
                     revenue FLOAT ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

2. External Tables

Hive only manages metadata, not the actual data.
Data can be stored in HDFS, S3, or any distributed storage.
If a table is dropped, data remains intact.
Best for raw datasets or shared data sources.

Example:

CREATE EXTERNAL TABLE sales_data 
             ( id INT, 
               product STRING, 
               revenue FLOAT ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION 'hdfs://path/to/data/';

?? Use case: If you need to process data without modifying the original files, use an external table.

Partitioning in Hive

Partitioning improves query performance by logically dividing data into separate directories.

Types of Partitioning

Hive supports static partitioning, dynamic partitioning, and subpartitioning.

1. Static Partitioning

The partition column value is explicitly specified when inserting data.
More efficient for small datasets.

Example:

CREATE TABLE sales (
    id INT,
    product STRING,
    revenue FLOAT
) PARTITIONED BY (country STRING)
STORED AS TEXTFILE;

INSERT INTO TABLE sales PARTITION (country='India') 
SELECT id, product, revenue 
FROM temp_sales 
WHERE country='India';

This command manually assigns data to the partition.

2. Dynamic Partitioning

The partition column value is automatically determined based on column values in the dataset.
Useful for bulk data ingestion.

Set these hive properties:

SET hive.exec.dynamic.partition = true; 
SET hive.exec.dynamic.partition.mode = nonstrict;

Example:

INSERT INTO TABLE sales PARTITION (country) 
SELECT id, product, revenue, country FROM temp_sales;

The country column dynamically determines the partition.

?? When to use dynamic partitioning?

é¢†è‹±æŽ¨è

Sqoop Tutorial: Big Data on Hadoop

Girish V 1 å¹´å‰

Sqoop

Darshika Srivastava 1 å¹´å‰

WHAT IS SQOOP

Ashish Ranjan 1 å¹´å‰

When data is spread across multiple partitions.
When inserting large datasets without manually specifying partitions.

3. Subpartitioning in Hive

Subpartitioning allows further partitioning within a partition, reducing data scanning overhead.

Example of Multi-Level Partitioning:

CREATE TABLE sales ( id INT, product STRING, revenue FLOAT ) 
PARTITIONED BY (country STRING, year INT, month INT) 
STORED AS PARQUET;

?? Advantages of Subpartitioning:

? Reduces the number of files per partition.

? Optimizes query performance further.

? Ideal for time-series data, where year and month subpartitions speed up queries.

Bucketing in Hive

Bucketing divides partitioned data into smaller, fixed-size files (buckets), improving query performance, especially for JOIN operations.

Why Use Bucketing?

Reduces the need for expensive shuffling in queries.
Helps optimize join operations by storing similar data in the same bucket.
In case of, unavailability of any unique columns in the data we cant use partitioning. In such cases Bucketing will help in performance.

Set these hive properties:

SET hive.exec.dynamic.partition = true; 
SET hive.exec.dynamic.partition.mode = nonstrict; 
set hive.enforce.bucketing = true;

Example: Creating a Bucketed Table

CREATE TABLE customers ( id INT, name STRING, age INT ) 
CLUSTERED BY (id) INTO 4 BUCKETS 
STORED AS ORC;

?? Bucketing is useful when partitioning alone is insufficient, especially when dealing with large-scale joins.

Significance of Bucketing with Partitions

Partitioning creates separate directories for each partition, whereas bucketing divides each partition into smaller buckets.

?? Why use Bucketing with Partitions? ? Improves query performance on large tables. ? Ensures even data distribution for joins and aggregations. ? Reduces shuffle overhead in MapReduce, Tez, and Spark jobs.

Example: Partition + Bucketing

CREATE TABLE sales ( id INT,
                       product STRING, 
                       revenue FLOAT ) 
PARTITIONED BY (year INT, month INT)
CLUSTERED BY (id) INTO 4 BUCKETS 
STORED AS ORC;

?? Best practice: Use partitioning for filtering queries, and bucketing for optimized joins and aggregations.

Hash Partitioning Algorithm in Hive

Bucketing in Hive is implemented using hash partitioning, where data is distributed across a fixed number of buckets based on a hash function applied to a selected column. This technique ensures even distribution of data, reducing shuffle overhead in joins and aggregations.

How Hash Partitioning Works?

A hash function is applied to the column specified in CLUSTERED BY().
The result of the hash function is divided by the total number of buckets (N).
The remainder (hash(column_value) % N) determines which bucket the data goes into.

Example: For N = 4 buckets and an id column as the bucketing key:

CREATE TABLE customers ( id INT, 
                        name STRING, 
                        age INT ) 
CLUSTERED BY (id) INTO 4 BUCKETS 
STORED AS ORC;

If id = 101, hash(101) % 4 = 1, so it goes into Bucket 1.
If id = 203, hash(203) % 4 = 3, so it goes into Bucket 3.

This ensures data spreads evenly across buckets, optimizing query performance.

How to Decide the Bucket Count in Hive?

Choosing the right number of buckets is crucial for performance optimization. A poor choice can lead to data skew (uneven bucket sizes) or small files (too many buckets).

Method 1: Deciding Bucket count based on file size and blocksize

N.of Buckets = 2^n >= N.of blocks

n ---> natural numbers

lets say, file/table size is 2.5 Gb and the block size in hdfs is 128 Mb
the n.of blocks is 2560 / 128 = 20 blocks
2^1 = 2
2^2 = 4
2^3 = 8
2^4 = 16
2^5 = 32 which is >= 20.
So, 32 buckets will be an optimal count

Method 2: Deciding Bucket Count for a 1 Billion Row Table

If joining with another table on user_id, estimate the distinct user count (e.g., 10 million unique users).
A good bucket count would be around âˆš(10 million) = 1000 buckets.
Ensure the other table has the same bucket count for efficient joins.

Best Practices for Choosing Buckets

? If bucketing for joins, use the same number of buckets across tables to avoid shuffle. ? Avoid using too few or too many bucketsâ€”start with sqrt(total distinct values).

? Perform a data skew analysis before deciding the bucket count.

? Use bucketing with partitioning when dealing with high-cardinality columns.

Conclusion

Apache Hive is a powerful data warehouse tool that simplifies querying large datasets using SQL-like queries. With features like partitioning, bucketing, ACID transactions, and external tables, Hive is a preferred choice for big data analytics in Hadoop ecosystems.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Lashman Balaçš„æ›´å¤šæ–‡ç«

AWS Glue: Serverless Data Integration Service

2025å¹´3æœˆ27æ—¥

AWS Glue: Serverless Data Integration Service

Introduction In todayâ€™s data-driven world, organizations need efficient ways to process and analyze vast amounts ofâ€¦
AWS Lambda: A Comprehensive Guide on Serverless Compute

2025å¹´3æœˆ26æ—¥

AWS Lambda: A Comprehensive Guide on Serverless Compute

Introduction AWS Lambda is a serverless compute service that allows developers to run code without provisioning orâ€¦
AWS S3: Ultimate Guide to Simple Storage Service

2025å¹´3æœˆ25æ—¥

AWS S3: Ultimate Guide to Simple Storage Service

Introduction to S3 Amazon Simple Storage Service (S3) is a scalable, high-speed, low-cost object storage serviceâ€¦
Databricks: The Unified Data Analytics Platform

2025å¹´3æœˆ24æ—¥

Databricks: The Unified Data Analytics Platform

Introduction In the era of big data and AI, businesses need scalable, unified, and cost-efficient platforms to handleâ€¦

1 æ¡è¯„è®º
DBT : A Comprehensive Guide to Data Build Tool

2025å¹´3æœˆ22æ—¥

DBT : A Comprehensive Guide to Data Build Tool

Introduction to dbt Modern data teams need efficient ways to transform raw data into meaningful insights. dbt (Dataâ€¦

1 æ¡è¯„è®º
Delta Lake: An Open Table Format for Reliable Lakehouse architecture

2025å¹´3æœˆ21æ—¥

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

The explosion of big data has led to a growing need for efficient, scalable, and reliable data management solutionsâ€¦

1 æ¡è¯„è®º
Understanding Apache Airflow: A Comprehensive Guide

2025å¹´3æœˆ20æ—¥

Understanding Apache Airflow: A Comprehensive Guide

Apache Airflow is a powerful open-source platform used for automating, scheduling, and monitoring complex workflowsâ€¦
Apache Kafka: A Deep Dive into Distributed Event Streaming

2025å¹´3æœˆ19æ—¥

Apache Kafka: A Deep Dive into Distributed Event Streaming

Introduction In the era of big data, organizations generate massive amounts of data that need to be processed, storedâ€¦
Apache Spark Structured Streaming

2025å¹´3æœˆ6æ—¥

Apache Spark Structured Streaming

Introduction Apache Spark Structured Streaming is a scalable, fault-tolerant stream processing engine built on top ofâ€¦
Apache Spark: The Ultimate Big Data Processing Engine

2025å¹´3æœˆ4æ—¥

Apache Spark: The Ultimate Big Data Processing Engine

1. Introduction to Apache Spark What is Apache Spark? Apache Spark is a lightning-fast, distributed computing frameworkâ€¦

1 æ¡è¯„è®º

See all articles

Introduction

Why Apache Hive?

Apache Hive Architecture

1. Hive Clients

2. Hive Services

3. Hive Storage

Key Features

Hive MetaStore: Embedded vs. External

1. Embedded MetaStore

2. External MetaStore

Hive Tables: Internal vs. External

1. Internal (Managed) Tables

2. External Tables

Partitioning in Hive

Types of Partitioning

1. Static Partitioning

2. Dynamic Partitioning

é¢†è‹±æŽ¨è

3. Subpartitioning in Hive

Bucketing in Hive

Why Use Bucketing?

Example: Creating a Bucketed Table

Significance of Bucketing with Partitions

Hash Partitioning Algorithm in Hive

How Hash Partitioning Works?

How to Decide the Bucket Count in Hive?

Best Practices for Choosing Buckets

Conclusion

Lashman Balaçš„æ›´å¤šæ–‡ç«

AWS Glue: Serverless Data Integration Service

AWS Lambda: A Comprehensive Guide on Serverless Compute

AWS S3: Ultimate Guide to Simple Storage Service

Databricks: The Unified Data Analytics Platform

DBT : A Comprehensive Guide to Data Build Tool

Delta Lake: An Open Table Format for Reliable Lakehouse architecture

Understanding Apache Airflow: A Comprehensive Guide

Apache Kafka: A Deep Dive into Distributed Event Streaming

Apache Spark Structured Streaming

Apache Spark: The Ultimate Big Data Processing Engine

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

HIVE

Hive

Introduction to Apache Sqoop

Unleashing the Power of Big Data with Apache Hive

What is Hive?

Apache Hive Performance Tuning Best Practices

Golden Monkey Go March in! Revo R (updated)

Sqoop

Impala vs Hive: Difference between Sql on Hadoop components

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†