YARN & MapR, YARN Requirements and YARN Frameworks
Abhishek Singh

YARN & MapR, YARN Requirements and YARN Frameworks

In continuation of my previous Article - A " " Big Data Architecture - For Solutions Architect, I am writing on YARN and its various frameworks using Hadoop MRV1 and MRV2 architecture.

Hadoop was always built with scalability in mind. But Hadoop has gone through at least four phases of development. Before MRV1 was Hadoop on Demand (HOD), which allowed users to access and consume shared cluster resources. MRV1 enabled sharing of cluster resources among users at the same time. MRV2 is the latest framework and it can achieve loose coupling between application framework (built on YARN), HDFS, and Resource Management. The Apache Hadoop ecosystem continues to grow beyond the simple MapReduce job. Although MapReduce remains at the core of many Hadoop 1.0 tasks, the introduction of YARN has expanded the capability of a Hadoop environment to move beyond the basic MapReduce process. The basic structure of Hadoop with Apache Hadoop MapReduce version 1 (MRv1) can be seen in Figure below. The two core services, Hadoop File System (HDFS) and MapReduce, form the basis for almost all Hadoop functionality. All other components are built around these services and must use MapReduce to run Hadoop jobs. Apache Hadoop provides a basis for large-scale MapReduce processing and has spawned a Big Data ecosystem of tools, applications, and vendors. While MapReduce methods enable users to focus on the problem at hand rather than the underlying.


Benefits of MRV1 -

1. latency is less

2. Improved cluster utilization due to lesser overheads

3. Multi-tenancy to support multiple users on the same cluster.


Benefits of MRV2 -

1. The scalability problem which some would argue with v1 design due to shared cluster is removed by this design

2. Improved reliability and availability this design due to another layer being advanced above the HDFS to work before the underlying hardware layer.

3. This design helps implement a flexible resource model by dynamically configuring individual nodes.

 

Now let's see the basic infrastructure level readiness which is required while architecting YARN in the Hadoop environment. To summarize the requirements for YARN, we need the following features (Murthy et al., (2014):

 

1.     Scalability

The next-generation platform should scale horizontally to tens of thousands of nodes and concurrent applications.

 

2.     Serviceability

The next-generation platform should enable of cluster software to be completely decoupled from users’ applications.

3.     Multitenancy

The next-generation platform should support multiple tenants to on the same cluster and enable fine-grained sharing of individual nodes among different tenants.

4.     Locality Awareness

The next-generation platform should support locality awareness— moving computation to the data is a major win for many applications.

 

5.     High Cluster Utilization

The next-generation platform should enable high utilization of the underlying physical resources.

6.     Secure and Auditable Operation

The next-generation platform should continue to enable secure and auditable usage of cluster resources.

7.     Reliability and Availability

The next-generation platform should have a very reliable user interaction and support high availability.

8.     Support for Programming Model Diversity

The next-generation platform should enable diverse programming models and evolve beyond just being MapReduce-centric.

9.     Flexible Resource Model

The next-generation platform should enable dynamic resource configurations on individual nodes and resource model.

10.  Backward Compatibility

The next-generation platform should maintain complete backward compatibility of existing MapReduce applications.

 Apache Tez

Apache Tez exploits YARN framework the most. Hadoop jobs of DAG – Directed Acyclic graph of tasks using separate MapR stages. What Tez basically spread these tasks across stages and allow them to run a single Job. This MapR jobs can be feed directly into another reduce job without any pass-through map task. This results in faster processing of jobs. This also gives a new direction to processing jobs in batches to query level feature. Tez can be used as a MapReduce replacement for projects such as Apache Hive and Apache Pig.

For more information, see https://tez.incubator.apache.org/, https://hortonworks.

com/hadoop/tez/, and https://hortonworks.com/labs/stinger/.


Apache Giraph

Apache Giraph is highly scalable, iterative graph processing system. It is based on Google’s Pregel that is used for calculating page rank (for the websites). Facebook, LinkedIn Twitter use it for creating for users. Giraph and Pregel model based on Bulk Synchronous Parallel (BSP) model of computation. Giraph was written on standard Hadoop MRV1 but was inefficient. The native Giraph implementation

under YARN provides the user with an iterative processing model not directly available with MapReduce.

 


Figure above is an illustration of an execution of a single shortest paths algorithm in Giraph. The input is a chain graph with three vertices(black) and two edges (green). The values of the edges are 1 and 3 respectively. The algorithm computes from the leftmost vertex. The initial values of the vertices are 0, 00 and 00 (top row). Distance upper bounds are sent (blue), resulting in updates to vertex values (successive rows going down). The execution lasts three super steps (separated by red lines).

v The algorithm the shortest patch in terms of t or t+1 and then uses the shortest path for iterative jobs.

 

An example ---

………

public static class SimpleShortestPathsVertexInputFormat extends

        TextVertexInputFormat<LongWritable, DoubleWritable, FloatWritable> {

    @Override

    public VertexReader<LongWritable, DoubleWritable, FloatWritable>

            createVertexReader(InputSplit split,

                               TaskAttemptContext context)

                               throws IOException {

        return new SimpleShortestPathsVertexReader(

            textInputFormat.createRecordReader(split, context));

    }

}

………………………………

 

For more information, see https://giraph.apache.org/.


 Hadoop MapReduce

As we all know MapR was the first YARN framework and fulfilled many MapR is quite good and does well in but with minor exceptions.

The two-step processes for MapR are given below:

1. The data is ingested using the "mapper" with an output of <key, value> pairings.

2. Reducer then uses algorithm through batch jobs and aggregates map output.

 This looks cool but you need to create separate jobs for the reduce functions, unlike Apache Tez.

 Hoya: HBase on YARN

It creates dynamic and elastic apache HBase clusters on top of YARN. It requires a client that setup HBase cluster XML files and then benign YARN to create an application master. YARN then copies required information into of the selected server and execute s the commands to start ApplicationMaster. HOYA starts HBase Master on the local machine, and in parallel HOYA requests YARN for containers matching the number of HBase region servers it needs. Hoya then provides commands to start the region servers and then YARN does the rest. This is because Apache Zookeeper has the responsibility to find each other and neither Hiya nor YARN gets

For more information, see https://hortonworks.com/blog/introducing-hoyahbase-on-yarn/.


Conclusion:

We have compared various model for MapR, viz MRV1 and MRV2 and seen pros and cons around it. Tez, Hoya, Giraph and MapR, all have advantages and disadvantages but Apache Tez is so far the most efficient model with YARN and it is pretty fast.

 

 

要查看或添加评论,请登录

Abhishek Singh的更多文章

社区洞察

其他会员也浏览了