Big Data :
Big data is a traditional method of storing and processing large amounts of data in parallel.
Characteristics Of Big Data:
Volume:
Refers to the sheer amount of data generated and stored. With the advent of the internet, social media, and IoT devices, data is being produced at an unprecedented rate. Examples include terabytes of data generated by social media platforms daily or the data collected by sensors in smart cities.
Velocity:
The speed at which data is generated, processed, and analyzed. In today's fast-paced world, data is created in real-time or near real-time, requiring quick processing to derive actionable insights. Examples include real-time stock market data, social media updates, and sensor data from connected devices.
Variety:
The different types of data available. Data comes in many forms, including structured data (like databases), semi-structured data (like XML or JSON), and unstructured data (like text, images, and videos). This diversity requires different techniques for processing and analyzing the data.
Veracity:
The trustworthiness and quality of the data. High veracity means that the data is accurate, reliable, and trustworthy, while low veracity indicates that the data might be incomplete, inconsistent, or inaccurate. Ensuring high data veracity involves cleaning and validating the data to ensure its quality before analysis.
Value:
The potential insights and benefits that can be derived from the data. The ultimate goal of big data is to extract valuable insights that can inform decision-making, drive business strategies, and create competitive advantages. This involves using advanced analytics, machine learning, and other techniques to uncover patterns, trends, and correlations in the data.
Big Data Storage:
Cluster:
If we have more than one system working together to store and process data, it is called a cluster. A cluster is a group of interconnected computers (nodes) that work together as a single system to ensure high availability, scalability, and fault tolerance. In the context of big data, clusters are used to distribute data and processing tasks across multiple nodes, allowing for efficient handling of large datasets.
Vertical Scaling And Horizontal Scaling:
Vertical scaling involves adding more power (CPU, RAM, storage) to an existing machine. This approach increases the capacity of a single node, making it more powerful to handle more load.
Horizontal scaling involves adding more machines (nodes) to a system, distributing the load across multiple nodes. This approach increases the overall system capacity by leveraging multiple nodes working together.
Hadoop Architecture:
HDFS: Hadoop Distribute File System
Commodity Hardware:
HDFS Architecture:
领英推荐
Master Node (NameNode):
Slave Nodes (DataNodes):
Data Storage and Replication:
Client Interaction:
Here’s a step-by-step summary of the process:
Hadoop New Version :
In newer versions of Hadoop, a secondary NameNode is introduced to mitigate the risk of data loss in case the primary NameNode fails. The secondary NameNode periodically checkpoints the metadata from the primary NameNode. If the primary NameNode fails, the secondary NameNode can be used to restore the most recent metadata checkpoint, minimizing downtime and ensuring continuity of operations. This setup enhances fault tolerance and improves the reliability of the Hadoop Distributed File System (HDFS) architecture.
Secondary NameNode and NameNode Relationship:
ZooKeeper and Journal Node:
FSImage Details:
Communication Between Components:
Ensuring High Availability: