Big Data :

Big data is a traditional method of storing and processing large amounts of data in parallel.




Characteristics Of Big Data:

  • Volume: The massive amount of data generated every second.
  • Velocity: The speed at which data is generated, collected, and processed.
  • Variety: The different types of data, including structured, semi-structured, and unstructured.
  • Veracity: The quality and accuracy of the data.
  • Value: The potential insights and benefits that can be derived from analyzing the data.


Volume:

Refers to the sheer amount of data generated and stored. With the advent of the internet, social media, and IoT devices, data is being produced at an unprecedented rate. Examples include terabytes of data generated by social media platforms daily or the data collected by sensors in smart cities.

Velocity:

The speed at which data is generated, processed, and analyzed. In today's fast-paced world, data is created in real-time or near real-time, requiring quick processing to derive actionable insights. Examples include real-time stock market data, social media updates, and sensor data from connected devices.

Variety:

The different types of data available. Data comes in many forms, including structured data (like databases), semi-structured data (like XML or JSON), and unstructured data (like text, images, and videos). This diversity requires different techniques for processing and analyzing the data.

Veracity:

The trustworthiness and quality of the data. High veracity means that the data is accurate, reliable, and trustworthy, while low veracity indicates that the data might be incomplete, inconsistent, or inaccurate. Ensuring high data veracity involves cleaning and validating the data to ensure its quality before analysis.

Value:

The potential insights and benefits that can be derived from the data. The ultimate goal of big data is to extract valuable insights that can inform decision-making, drive business strategies, and create competitive advantages. This involves using advanced analytics, machine learning, and other techniques to uncover patterns, trends, and correlations in the data.


Big Data Storage:

Cluster:

If we have more than one system working together to store and process data, it is called a cluster. A cluster is a group of interconnected computers (nodes) that work together as a single system to ensure high availability, scalability, and fault tolerance. In the context of big data, clusters are used to distribute data and processing tasks across multiple nodes, allowing for efficient handling of large datasets.


Vertical Scaling And Horizontal Scaling:

Vertical scaling involves adding more power (CPU, RAM, storage) to an existing machine. This approach increases the capacity of a single node, making it more powerful to handle more load.

Horizontal scaling involves adding more machines (nodes) to a system, distributing the load across multiple nodes. This approach increases the overall system capacity by leveraging multiple nodes working together.


Hadoop Architecture:


HDFS: Hadoop Distribute File System

Commodity Hardware:

  • In the architecture, data nodes are referred to as commodity hardware.
  • Commodity hardware means machines with sufficient computational power that are cost-effective.


HDFS Architecture:

  • HDFS (Hadoop Distributed File System) is based on a master-slave architecture.


Master Node (NameNode):

  • The master node is called the NameNode.
  • The NameNode manages metadata, which includes details about data blocks, data nodes, and racks.


Slave Nodes (DataNodes):

  • Multiple slave nodes are called DataNodes.
  • DataNodes store data in blocks.
  • Each block has a default size of 128 MB.
  • Data is stored in replicated form, with a default replication factor of 3.


Data Storage and Replication:

  • Data blocks are stored across different racks for fault tolerance.
  • The NameNode stores and manages metadata about blocks, DataNodes, and racks.


Client Interaction:

  • The client first communicates with the NameNode to get details about DataNodes.
  • A network switch facilitates communication between the NameNode and DataNodes.
  • The client then communicates with the network switch to interact with DataNodes.
  • After data storage is completed, an acknowledgment is sent back to the client.


Here’s a step-by-step summary of the process:

  1. The client requests data storage.
  2. The NameNode provides details about DataNodes.
  3. The client communicates with the network switch.
  4. Data is stored in blocks across DataNodes and racks.
  5. The storage process is completed.
  6. An acknowledgment is sent back to the client.


Hadoop New Version :

In newer versions of Hadoop, a secondary NameNode is introduced to mitigate the risk of data loss in case the primary NameNode fails. The secondary NameNode periodically checkpoints the metadata from the primary NameNode. If the primary NameNode fails, the secondary NameNode can be used to restore the most recent metadata checkpoint, minimizing downtime and ensuring continuity of operations. This setup enhances fault tolerance and improves the reliability of the Hadoop Distributed File System (HDFS) architecture.



Secondary NameNode and NameNode Relationship:

  • In Hadoop architecture, a Secondary NameNode is introduced to handle metadata backup and recovery.
  • If the primary NameNode fails, the Secondary NameNode can restore the most recent metadata checkpoint to minimize downtime and ensure data integrity.

ZooKeeper and Journal Node:

  • ZooKeeper is utilized in Hadoop clusters to manage and coordinate services.
  • It plays a critical role in Hadoop's high availability setup, particularly in managing failover scenarios.

  • Journal Node Functionality:

  • Journal Nodes store the EditLogs, which contain a record of every change made to the Hadoop filesystem (HDFS).The EditLogs include details such as data node updates and filesystem operations.

FSImage Details:

  • FSImage is a snapshot of the filesystem metadata stored on the NameNode.
  • It includes critical information about data node states and other filesystem metadata.


Communication Between Components:

  • ZooKeeper sends messages to the Secondary NameNode in the event of a primary NameNode failure.
  • This communication ensures that the Secondary NameNode can take over efficiently and restore the latest FSImage checkpoint stored in the Journal Node.

Ensuring High Availability:

  • The integration of ZooKeeper, Secondary NameNode, and Journal Nodes enhances fault tolerance and ensures high availability in Hadoop clusters.
  • This setup minimizes the risk of data loss and downtime, crucial for maintaining uninterrupted operations in large-scale data environments.



要查看或添加评论,请登录

社区洞察

其他会员也浏览了