Hadoop vs Hive

Hadoop vs Hive

Difference Between Hadoop vs Hive

Hadoop is a Framework or Software invented to manage huge data or Big Data. Hadoop stores and processes extensive data distributed across a cluster of commodity servers. Hadoop stores the data using Hadoop distributed file system and process/query it using the Map-Reduce programming model. Hive is an application that runs over the Hadoop framework and provides an SQL-like interface for processing/querying the data. Hive was designed and developed by Facebook before becoming part of the Apache-Hadoop project. Hive runs its query using HQL (Hive query language). Hive has the same structure as RDBMS, and almost the same commands can be used in Hive. Hive can store the data in external tables, so it’s not mandatory to use HDFS. Also, it supports file formats such as ORC, Avro files, Sequence Files and Text files, etc.

ADVERTISEMENT

Popular Course in this category

HIVE Course Bundle - 7 Courses in 1

Hadoop’s Major Components

Figure 1, a Basic architecture of a Hadoop component.

Hadoop Base/Common: Hadoop Common will provide one platform to install all its components.

ADVERTISEMENT

All-in-One Data Science Bundle - 400+ Courses | 550+ Mock Tests | 2000+ Hours | Lifetime | 2000+ Hour of HD Videos | 80 Learning Paths | 400+ Courses | Verifiable Certificate of Completion | Lifetime Access 4.7

HDFS (Hadoop Distributed File System): HDFS is a major part of the Hadoop framework. It?takes care of all the data in the Hadoop Cluster. It works on Master/Slave Architecture and stores the data using replication.

Master/Slave Architecture & Replication

  • Master Node/Name Node: The name node stores the metadata of each block/file stored in HDFS; HDFS can have only one Master Node (Incase of HA, another Master Node will work as a Secondary Master Node).
  • Slave Node/Data Node: Data nodes contain actual data files in blocks. HDFS can have multiple Data Nodes.
  • Replication: HDFS stores its data by dividing it into blocks. The default block size is 64 MB. Due to replication, data gets stored into 3 (Default Replication factor, which can be increased as per requirement) different Data Nodes; hence there is the slightest possibility of losing the data in case of any node failure.

YARN (Yet Another Resource Negotiator): It manages Hadoop resources. Also, it plays a vital role in scheduling users’ applications.

MR (Map Reduce): This is the primary programming model of Hadoop. It is used to process/query the data within the Hadoop framework.

Hive’s Major Components

Figure 2: Hive’s Architecture & Its Major Components

Hive Clients: Besides SQL, Hive also supports programming languages like Java, C, and Python using various drivers such as ODBC, JDBC, and Thrift. One can write any Hive client application in other languages and can run in Hive using these Clients.

ADVERTISEMENT

MS Excel & VBA for Data Science Course Bundle - 24 Courses in 1 | 10 Mock Tests 87+ Hours of HD Videos | 24 Courses | 10 Mock Tests & Quizzes | Verifiable Certificate of Completion | Lifetime Access 4.5

Hive Services: Under Hive services, execution of commands and queries take place. Hive Web Interface has five sub-components.

  • CLI: Default command line interface provided by Hive for the execution of Hive queries/commands.
  • Hive Web Interfaces: It is a simple graphical user interface. This provides an alternative to the Hive command line and enables running queries and commands within the Hive application.
  • Hive Server: It is also called Apache Thrift. It is responsible for taking commands from different command-line interfaces and submitting all the commands/queries to Hive; also, it retrieves the final result.
  • Apache Hive Driver: It is responsible for taking the inputs from the CLI, the web UI, ODBC, JDBC, or Thrift interfaces by a client and passing the information to the meta store where all the file information is stored.
  • Metastore: Metastore is a repository to store all Hive metadata information. Hive’s metadata stores information such as the structure of tables, partitions & column type, etc.

要查看或添加评论,请登录

Darshika Srivastava的更多文章

  • Snowflake

    Snowflake

    What is the Snowflake Data Platform? While data is a core asset for modern enterprises, technology’s ability to scale…

  • Business Intelligence

    Business Intelligence

    What is Business Intelligence? Business intelligence (BI) is a technology-driven process for analyzing data and…

  • Azure Databricks

    Azure Databricks

    What is Azure Databricks? Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and…

  • self-service data

    self-service data

    What is self-service data? What are its key characteristics? Self-service data refers to a set of processes, tools, and…

  • Graphical User Interface

    Graphical User Interface

    What is Graphical User Interface (GUI)? A system of interactive visual components for a computer or system software is…

  • Insight Generation

    Insight Generation

    What is Insight Generation? Insight generation involves analyzing data to uncover valuable insights that drive…

  • Fraud Detection

    Fraud Detection

    What is Fraud Detection? Fraud detection is the process of identifying suspicious activity that indicates criminal…

  • Unsupervised Learning

    Unsupervised Learning

    What is Unsupervised Learning? As the name suggests, unsupervised learning is a machine learning technique in which…

  • Monetization

    Monetization

    What is Monetization? Monetization is the process of creating revenue from things or actions that don’t currently make…

  • Insight Generation

    Insight Generation

    What is Insight Generation? Insight generation involves analyzing data to uncover valuable insights that drive…

社区洞察

其他会员也浏览了