HDFS

HDFS

HDFS stands for Hadoop Distributed File System. It's a distributed file system that's part of Hadoop. HDFS is designed to store and manage large datasets. It's used for big data processing, where it provides fault-tolerant storage.?

Features of HDFS

  • Fault-tolerant: HDFS can detect and automatically recover faults that occur on any of the machines.?
  • High throughput: HDFS is designed to store and scan millions of rows of data.?
  • Cost-effective: HDFS can be built on commodity hardware, which is low-priced and easily available.?
  • Parallel processing: HDFS ensures parallel processing and optimized data storage.?
  • Data locality: HDFS moves the processing unit to the data, rather than the data to the processing unit.?

Components of HDFS

  • Name Node: The central controller of HDFS that manages the file system namespace and regulates access to files.?
  • Data Node: A slave node that stores the data in the local file ext3 or ext4.?

HDFS divides files into blocks and stores each block on a Data Node.?

Nodes: Master-slave nodes typically forms the HDFS cluster.?

  1. Name Node (Master Node):?Manages all the slave nodes and assign work to them. It executes filesystem namespace operations like opening, closing, renaming files and directories. It should be deployed on reliable hardware which has the high config. not on commodity hardware.
  2. Data Node(Slave Node):?Actual worker nodes, who do the actual work like reading, writing, processing etc. Hey also perform creation, deletion, and replication upon instruction from the master. Hey can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.?

  • Name nodes:?Run on the master node. Store metadata (data about data) like file path, the number of blocks, block Ids. etc. Require high amount of RAM. Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk.
  • DataModes:?Run on slave nodes. Require high memory as data is actually stored here.

Terms related to HDFS:??

  • HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
  • Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.
  • Replication:: It is done by datanode.

Note: No two replicas of the same block are present on the same datanode.?

Features:??

  • Distributed data storage.
  • Blocks reduce seek time.
  • The data is highly available as the same block is present at multiple datanodes.
  • Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
  • High fault tolerance.

Limitations: Though HDFS provide many features there are some areas where it doesn’t work well.?

  • Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.
  • Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another datanode to retrieve each small file, this whole process is a very inefficient data access pattern.

要查看或添加评论,请登录

Rohit Singh的更多文章

  • Matillion

    Matillion

    Matillion is a cloud-native data integration platform that simplifies and accelerates the ELT (Extract, Load…

  • Azure Blob storage

    Azure Blob storage

    Blob storage is a type of cloud storage for unstructured data, like images, videos, or documents, where data is stored…

  • BI Testing

    BI Testing

    BI testing, or Business Intelligence testing, verifies and validates the accuracy and reliability of insights delivered…

  • Amazon Elastic Container Service (Amazon ECS)

    Amazon Elastic Container Service (Amazon ECS)

    Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies the…

  • User Acceptance Testing (UAT)

    User Acceptance Testing (UAT)

    User Acceptance Testing (UAT) is a crucial phase in software testing where the software is tested in a real-world…

  • Software Development Engineer in Test (SDET)

    Software Development Engineer in Test (SDET)

    Software Development Engineer in Test (SDET) is a developer with the primary responsibility for the development of…

    1 条评论
  • Data center

    Data center

    A data center is essentially a building or a dedicated space within a building that serves as a central hub for…

  • Network security engineer

    Network security engineer

    A Network and Security Engineer designs, implements, and maintains secure network infrastructure, protecting systems…

  • Firewall

    Firewall

    A firewall is a network security device either hardware or software-based which monitors all incoming and outgoing…

  • Apache Sqoop

    Apache Sqoop

    Apache Sqoop is a command-line tool that transfers data between relational databases and Hadoop. It's used to import…

社区洞察

其他会员也浏览了