登录查看更多内容

Cloudera - Replacing Map Reduce execution framework with Spark - A potential misstep!!!

Durga Gadiraju

Founder @ ITVersity | GVP Data and Analytics @ Infolob | Cloud Transformation, Data Services | Agentic AI Evangelist & Thought Leader

发布日期: 2015年9月25日

Cloudera is seriously pursuing to replace map reduce execution framework with Spark. Being in industry for more than a decade with 4+ years of experience in Big Data, I feel it could be a serious misstep.

Already Cloudera is recommending to go with high end hardware against commodity hardware due to which Big Data got traction in last 4 years in small and mid size enterprises. Lately in pursuit of memory intensive technologies like Impala and Spark, Cloudera is recommending enterprise level hardware against commodity hardware.

For most of the applications:

* Performance is not only the criteria, but big data vendors increasingly pushing their products on the name of performance alone. Apache Spark is being pushed for the same.

* Lot of use cases for which Hadoop is considered can be executed with legacy enterprise technologies like Oracle, Teradata etc, but most of the enterprises want to move to Hadoop for cost effectiveness (which includes cost of hardware and maintenance). Question is, will it matter, if my ETL process takes few hours to load the recommendation engine database over few minutes using these memory intensive expensive applications?

* Re-engineering is not cheap. Lot of enterprises have struggled to move their use cases to Hadoop. Now they have to re-engineer again starting from hardware to software.

* Neither functional programming languages, nor sql based interfaces are easier transition for the individuals. Explaining those subtle differences such as sort by of hive and order by it self is a big challenge.

* Skill development - Training from most of the big data vendors it self is flawed, it is either too generic or too specific.

* Most of the issues with the applications cannot be solved with the power of software (running in memory like Spark) or with expensive hardware. It is the proper design understanding nuances of what ever technology we have decided to use.

Open source software is not cheap or free, it has many hidden costs unless it solves real business problem. In my perception by preferring Spark over Map Reduce framework you are trying to push the solution for a problem that never exists for most of the enterprises. It happened with YARN, most of the companies do not have thousands of nodes in their Hadoop cluster but still it has become default and forced users to upgrade their existing clusters.

A light weight open source product which can do the ETL between any source and any target leveraging capabilities of underlying technology is the need of the hour.

Please subscribe to my channel to learn more about Big Data, Hadoop, Cloud etc. https://www.youtube.com/channel/UCakdSIPsJqiOLqylgoYmwQg

Kishore Dandu

Solution Architect | Data Engineer | AI Engineer

9 年

Do u have a link from cloudera that substantiates this statement. Thx

Rajdeep Dua

Vice President of Engineering at Salesforce > Working on Machine Learning, Gen AI

9 年

spark also uses map reduce

查看更多评论

要查看或添加评论，请登录

Durga Gadiraju的更多文章

July Online Live Meetups - Apache Spark and NiFi

2020年7月16日

July Online Live Meetups - Apache Spark and NiFi

We are glad to announce free online meetups going forward. There will be approximately 2 to 8 events per month and we…
CCA 175 Resources, Tips and Techniques

2020年4月15日

CCA 175 Resources, Tips and Techniques

As part of this article we are providing all the ITVersity Resources, Tips and Techniques to prepare for Cloudera…

8 条评论
Building Streaming Pipelines - Databricks

2019年10月30日

Building Streaming Pipelines - Databricks

As part of this topic we will see how to ingest data in real time using Kafka eco system and process using Spark…
Data Engineers - Overview of Python

2019年10月20日

Data Engineers - Overview of Python

As part of this session, we will cover what all features should one learn in Python to become Cloud based Data…

1 条评论
Data Engineers — Setting up the Development Environment

2019年10月6日

Data Engineers — Setting up the Development Environment

As part of Roadmap to Data Engineering Bootcamp series, in this session we will see what is required to setup the…
Session 09 – Shell Scripting – Develop centralized monitoring script

2018年6月28日

Session 09 – Shell Scripting – Develop centralized monitoring script

Let us understand how we can develop monitoring application using shell scripting and advanced linux commands. This is…
Transition to be Data Engineer using Big Data eco system

2018年6月22日

Transition to be Data Engineer using Big Data eco system

Are you working professional with experience in one of the below roles and transition to Data Engineer? Mainframes…

2 条评论
Linux Fundamentals for Big Data Professionals

2018年6月22日

Linux Fundamentals for Big Data Professionals

We are glad to announce a live YouTube Series on Linux Fundamentals which every Software Professional should be aware…
Setup Development Environment – Big Data – Hadoop and Spark

2018年4月19日

Setup Development Environment – Big Data – Hadoop and Spark

Click here for the detailed blog which shows step by step instructions about setting up development environment to…

2 条评论
Pursuit of Happyness!!!

2018年1月17日

Pursuit of Happyness!!!

It was December 2003..

7 条评论

See all articles

Cloudera - Replacing Map Reduce execution framework with Spark - A potential misstep!!!

Durga Gadiraju

Founder @ ITVersity | GVP Data and Analytics @ Infolob | Cloud Transformation, Data Services | Agentic AI Evangelist & Thought Leader

Durga Gadiraju的更多文章

社区洞察

其他会员也浏览了

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)

The 3 V's of Big Data is Volume, Velocity & Variety but what is 3 S's of Spark?

A Journey into the World of Big Data: Technologies and Experiences

Unlocking the Power of Big Data Technologies

BIG DATA PROCESSING

Big Data 2015-2016: A look back and a look ahead

Hadoop vs. Databricks: A Comparison in Some Key Points

3 S’s of Spark and its impact on big data

Durga Gadiraju的更多文章

July Online Live Meetups - Apache Spark and NiFi

CCA 175 Resources, Tips and Techniques

Building Streaming Pipelines - Databricks

Data Engineers - Overview of Python

Data Engineers — Setting up the Development Environment

Session 09 – Shell Scripting – Develop centralized monitoring script

Transition to be Data Engineer using Big Data eco system

Linux Fundamentals for Big Data Professionals

Setup Development Environment – Big Data – Hadoop and Spark

Pursuit of Happyness!!!

社区洞察

其他会员也浏览了

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Big Data Diagnosis: (Hadoop & Distributed Storage Clusters)

The 3 V's of Big Data is Volume, Velocity & Variety but what is 3 S's of Spark?

A Journey into the World of Big Data: Technologies and Experiences

Unlocking the Power of Big Data Technologies

BIG DATA PROCESSING

Big Data 2015-2016: A look back and a look ahead

Hadoop vs. Databricks: A Comparison in Some Key Points

3 S’s of Spark and its impact on big data