Best Guide to Features and Design Principles of Hadoop
1.Objective
In this Hadoop tutorial we will discuss about the features, characteristics and design principles of Hadoop. To learn about what is hadoop and its introduction follow this guide.
To install and configure Hadoop follow this installation guide.
2.Hadoop Features and Characteristics
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides world’s most reliable storage layer – HDFS, a batch Processing engine – MapReduce and a Resource Management Layer – YARN.below are the important features of Hadoop:
- Open-source – Apache Hadoop is an open source project. It means its code can be modified according to business requirements.
- Distributed Processing – As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel on cluster of nodes.
- Fault Tolerance – By default 3 replicas of each block is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework. This is how Hadoop is fault tolerant.
- Reliability – Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.
- High Availability – Data is highly available and accessible despite hardware failure due to multiple copies of data. If a machine or few hardware crashes, then data will be accessed from other path.
- Scalability – Hadoop is highly scalable in the way new hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the fly without any downtime.
- Economic – Hadoop is not very expensive as it runs on cluster of commodity hardware. We do not need any specialized machine for it. Hadoop provides huge cost saving also as it is very easy to add more nodes on the fly here. So if requirement increases, you can increase nodes as well without any downtime and without requiring much of pre planning.
- Easy to use – No need of client to deal with distributed computing, framework takes care of all the things. So it is easy to use.
- Data Locality – Hadoop works on data locality principle which states that move computation to data instead of data to computation. When client submits the algorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where algorithm is submitted and then processing it.
3.Hadoop Assumptions
Hadoop is written with large clusters of computers in mind and is built around the following assumptions:
- Hardware may fail, (as commodity hardware can be used)
- Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency
- Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size
- Applications need a write-once-read-many access model
- Moving Computation is Cheaper than Moving Data