ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

YARN & MapR, YARN Requirements and YARN Frameworks

Abhishek Singh

Senior Engineering Manager DevOps @ Razorpay | Cloud Infrastructure, Automation

å‘å¸ƒæ—¥æœŸ: 2017å¹´12æœˆ31æ—¥

In continuation of my previous Article - A " " Big Data Architecture - For Solutions Architect, I am writing on YARN and its various frameworks using Hadoop MRV1 and MRV2 architecture.

Hadoop was always built with scalability in mind. But Hadoop has gone through at least four phases of development. Before MRV1 was Hadoop on Demand (HOD), which allowed users to access and consume shared cluster resources. MRV1 enabled sharing of cluster resources among users at the same time. MRV2 is the latest framework and it can achieve loose coupling between application framework (built on YARN), HDFS, and Resource Management. The Apache Hadoop ecosystem continues to grow beyond the simple MapReduce job. Although MapReduce remains at the core of many Hadoop 1.0 tasks, the introduction of YARN has expanded the capability of a Hadoop environment to move beyond the basic MapReduce process. The basic structure of Hadoop with Apache Hadoop MapReduce version 1 (MRv1) can be seen in Figure below. The two core services, Hadoop File System (HDFS) and MapReduce, form the basis for almost all Hadoop functionality. All other components are built around these services and must use MapReduce to run Hadoop jobs. Apache Hadoop provides a basis for large-scale MapReduce processing and has spawned a Big Data ecosystem of tools, applications, and vendors. While MapReduce methods enable users to focus on the problem at hand rather than the underlying.

Benefits of MRV1 -

1. latency is less

2. Improved cluster utilization due to lesser overheads

3. Multi-tenancy to support multiple users on the same cluster.

Benefits of MRV2 -

1. The scalability problem which some would argue with v1 design due to shared cluster is removed by this design

2. Improved reliability and availability this design due to another layer being advanced above the HDFS to work before the underlying hardware layer.

3. This design helps implement a flexible resource model by dynamically configuring individual nodes.

Now let's see the basic infrastructure level readiness which is required while architecting YARN in the Hadoop environment. To summarize the requirements for YARN, we need the following features (Murthy et al., (2014):

1. Scalability

The next-generation platform should scale horizontally to tens of thousands of nodes and concurrent applications.

2. Serviceability

The next-generation platform should enable of cluster software to be completely decoupled from usersâ€™ applications.

3. Multitenancy

The next-generation platform should support multiple tenants to on the same cluster and enable fine-grained sharing of individual nodes among different tenants.

4. Locality Awareness

The next-generation platform should support locality awarenessâ€” moving computation to the data is a major win for many applications.

5. High Cluster Utilization

The next-generation platform should enable high utilization of the underlying physical resources.

6. Secure and Auditable Operation

The next-generation platform should continue to enable secure and auditable usage of cluster resources.

7. Reliability and Availability

The next-generation platform should have a very reliable user interaction and support high availability.

8. Support for Programming Model Diversity

The next-generation platform should enable diverse programming models and evolve beyond just being MapReduce-centric.

9. Flexible Resource Model

The next-generation platform should enable dynamic resource configurations on individual nodes and resource model.

10. Backward Compatibility

The next-generation platform should maintain complete backward compatibility of existing MapReduce applications.

Apache Tez

Apache Tez exploits YARN framework the most. Hadoop jobs of DAG â€“ Directed Acyclic graph of tasks using separate MapR stages. What Tez basically spread these tasks across stages and allow them to run a single Job. This MapR jobs can be feed directly into another reduce job without any pass-through map task. This results in faster processing of jobs. This also gives a new direction to processing jobs in batches to query level feature. Tez can be used as a MapReduce replacement for projects such as Apache Hive and Apache Pig.

For more information, see https://tez.incubator.apache.org/, https://hortonworks.

com/hadoop/tez/, and https://hortonworks.com/labs/stinger/.

Apache Giraph

Apache Giraph is highly scalable, iterative graph processing system. It is based on Googleâ€™s Pregel that is used for calculating page rank (for the websites). Facebook, LinkedIn Twitter use it for creating for users. Giraph and Pregel model based on Bulk Synchronous Parallel (BSP) model of computation. Giraph was written on standard Hadoop MRV1 but was inefficient. The native Giraph implementation

under YARN provides the user with an iterative processing model not directly available with MapReduce.

Figure above is an illustration of an execution of a single shortest paths algorithm in Giraph. The input is a chain graph with three vertices(black) and two edges (green). The values of the edges are 1 and 3 respectively. The algorithm computes from the leftmost vertex. The initial values of the vertices are 0, 00 and 00 (top row). Distance upper bounds are sent (blue), resulting in updates to vertex values (successive rows going down). The execution lasts three super steps (separated by red lines).

v The algorithm the shortest patch in terms of t or t+1 and then uses the shortest path for iterative jobs.

An example ---

â€¦â€¦â€¦

public static class SimpleShortestPathsVertexInputFormat extends

TextVertexInputFormat<LongWritable, DoubleWritable, FloatWritable> {

@Override

public VertexReader<LongWritable, DoubleWritable, FloatWritable>

createVertexReader(InputSplit split,

TaskAttemptContext context)

throws IOException {

return new SimpleShortestPathsVertexReader(

textInputFormat.createRecordReader(split, context));

}

â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦â€¦

For more information, see https://giraph.apache.org/.

Hadoop MapReduce

As we all know MapR was the first YARN framework and fulfilled many MapR is quite good and does well in but with minor exceptions.

The two-step processes for MapR are given below:

1. The data is ingested using the "mapper" with an output of <key, value> pairings.

2. Reducer then uses algorithm through batch jobs and aggregates map output.

This looks cool but you need to create separate jobs for the reduce functions, unlike Apache Tez.

Hoya: HBase on YARN

It creates dynamic and elastic apache HBase clusters on top of YARN. It requires a client that setup HBase cluster XML files and then benign YARN to create an application master. YARN then copies required information into of the selected server and execute s the commands to start ApplicationMaster. HOYA starts HBase Master on the local machine, and in parallel HOYA requests YARN for containers matching the number of HBase region servers it needs. Hoya then provides commands to start the region servers and then YARN does the rest. This is because Apache Zookeeper has the responsibility to find each other and neither Hiya nor YARN gets

For more information, see https://hortonworks.com/blog/introducing-hoyahbase-on-yarn/.

Conclusion:

We have compared various model for MapR, viz MRV1 and MRV2 and seen pros and cons around it. Tez, Hoya, Giraph and MapR, all have advantages and disadvantages but Apache Tez is so far the most efficient model with YARN and it is pretty fast.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Abhishek Singhçš„æ›´å¤šæ–‡ç«

Good Data, Dirty Data - How To Identify Them?

2019å¹´4æœˆ17æ—¥

Good Data, Dirty Data - How To Identify Them?

Data is more important than ever and no company can think of considering data as a byproduct anymore. It is importantâ€¦

1 æ¡è¯„è®º
Disruption Trends and demand from IoT - (The Introduction - Part 1)

2019å¹´1æœˆ19æ—¥

Disruption Trends and demand from IoT - (The Introduction - Part 1)

In business, a disruptive innovation is an innovation that creates a new market and value network and eventuallyâ€¦

1 æ¡è¯„è®º
How To Migrate On-premise Database To IBM DB2 On Cloud?

2018å¹´6æœˆ25æ—¥

How To Migrate On-premise Database To IBM DB2 On Cloud?

Introduction: Database migration can look simple from outside, viz. get the source data, and import/load to the targetâ€¦

2 æ¡è¯„è®º
All you need to Know about AWS CloudFront

2018å¹´5æœˆ17æ—¥

All you need to Know about AWS CloudFront

All you need to Know about AWS CloudFront. CloudFront is a caching mechanism from AWS to support quality of service toâ€¦
RedShift HSM Integration With Hybrid Cloud

2018å¹´5æœˆ13æ—¥

RedShift HSM Integration With Hybrid Cloud

This article will provide basic integration steps to consider in order to pass on information across hybrid environmentâ€¦
How To Secure GlusterFS ~ OpenStack Cloud?

2018å¹´4æœˆ30æ—¥

How To Secure GlusterFS ~ OpenStack Cloud?

GlusterFS is an opensource since 2005. The name - Gluster arrived from the contraction of the words GNU and cluster.
Part1 - Dockers on Ubuntu Xenial 16.04 (LTS) and troubleshooting

2018å¹´2æœˆ17æ—¥

Part1 - Dockers on Ubuntu Xenial 16.04 (LTS) and troubleshooting

Installing dockers on Ubuntu Xenial 16.04 (LTS) ----- First build a VM with Ubuntu Xenial LTS image with 1 CPU and 8 GBâ€¦
Threats and Vulnerabilities - Practical Approach To Secure Openstack Cloud

2018å¹´2æœˆ10æ—¥

Threats and Vulnerabilities - Practical Approach To Secure Openstack Cloud

New features are getting introduced every six months with an intention to make openstack cloud a secure cloud. Someâ€¦
Start() with python & R - Machine Learning - Data Pre-processing

2018å¹´2æœˆ8æ—¥

Start() with python & R - Machine Learning - Data Pre-processing

From the very first of python and R of my Lab..
How To Use SNS and SQS. How Can They Help Building More Resilient Services.

2018å¹´1æœˆ7æ—¥

How To Use SNS and SQS. How Can They Help Building More Resilient Services.

One of the restaurant application customers once was so keen on using messaging through cloud services and thus, itâ€¦

See all articles

YARN & MapR, YARN Requirements and YARN Frameworks

Abhishek Singh

Senior Engineering Manager DevOps @ Razorpay | Cloud Infrastructure, Automation

Abhishek Singhçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What Are The Key Differences Between Spark And Hadoop?

Now Youâ€™re a Hadoop Expert

Hadoop vs Hive

Hadoop: Pioneering the Era of Big Data Storage Technologies

Unleashing the Power of Big Data with Hadoop

Apache YARN: The Resource Manager for Hadoop Ecosystem

Configure Hadoop and start cluster services using Ansible Playbook

A Comprehensive Guide to Hadoop YARN - Yet Another Resource Negotiator.

Hadoop: A Powerful Tool for Big Data Management

Abhishek Singhçš„æ›´å¤šæ–‡ç«

Good Data, Dirty Data - How To Identify Them?

Disruption Trends and demand from IoT - (The Introduction - Part 1)

How To Migrate On-premise Database To IBM DB2 On Cloud?

All you need to Know about AWS CloudFront

RedShift HSM Integration With Hybrid Cloud

How To Secure GlusterFS ~ OpenStack Cloud?

Part1 - Dockers on Ubuntu Xenial 16.04 (LTS) and troubleshooting

Threats and Vulnerabilities - Practical Approach To Secure Openstack Cloud

Start() with python & R - Machine Learning - Data Pre-processing

How To Use SNS and SQS. How Can They Help Building More Resilient Services.

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What Are The Key Differences Between Spark And Hadoop?

Now Youâ€™re a Hadoop Expert

Hadoop vs Hive

Hadoop: Pioneering the Era of Big Data Storage Technologies

Unleashing the Power of Big Data with Hadoop

Apache YARN: The Resource Manager for Hadoop Ecosystem

Configure Hadoop and start cluster services using Ansible Playbook

A Comprehensive Guide to Hadoop YARN - Yet Another Resource Negotiator.

Hadoop: A Powerful Tool for Big Data Management

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†