Cloudera - Replacing Map Reduce execution framework with Spark - A potential misstep!!!
Durga Gadiraju
Founder @ ITVersity | GVP Data and Analytics @ Infolob | Cloud Transformation, Data Services | Agentic AI Evangelist & Thought Leader
Cloudera is seriously pursuing to replace map reduce execution framework with Spark. Being in industry for more than a decade with 4+ years of experience in Big Data, I feel it could be a serious misstep.
Already Cloudera is recommending to go with high end hardware against commodity hardware due to which Big Data got traction in last 4 years in small and mid size enterprises. Lately in pursuit of memory intensive technologies like Impala and Spark, Cloudera is recommending enterprise level hardware against commodity hardware.
For most of the applications:
* Performance is not only the criteria, but big data vendors increasingly pushing their products on the name of performance alone. Apache Spark is being pushed for the same.
* Lot of use cases for which Hadoop is considered can be executed with legacy enterprise technologies like Oracle, Teradata etc, but most of the enterprises want to move to Hadoop for cost effectiveness (which includes cost of hardware and maintenance). Question is, will it matter, if my ETL process takes few hours to load the recommendation engine database over few minutes using these memory intensive expensive applications?
* Re-engineering is not cheap. Lot of enterprises have struggled to move their use cases to Hadoop. Now they have to re-engineer again starting from hardware to software.
* Neither functional programming languages, nor sql based interfaces are easier transition for the individuals. Explaining those subtle differences such as sort by of hive and order by it self is a big challenge.
* Skill development - Training from most of the big data vendors it self is flawed, it is either too generic or too specific.
* Most of the issues with the applications cannot be solved with the power of software (running in memory like Spark) or with expensive hardware. It is the proper design understanding nuances of what ever technology we have decided to use.
Open source software is not cheap or free, it has many hidden costs unless it solves real business problem. In my perception by preferring Spark over Map Reduce framework you are trying to push the solution for a problem that never exists for most of the enterprises. It happened with YARN, most of the companies do not have thousands of nodes in their Hadoop cluster but still it has become default and forced users to upgrade their existing clusters.
A light weight open source product which can do the ETL between any source and any target leveraging capabilities of underlying technology is the need of the hour.
Please subscribe to my channel to learn more about Big Data, Hadoop, Cloud etc. https://www.youtube.com/channel/UCakdSIPsJqiOLqylgoYmwQg
Solution Architect | Data Engineer | AI Engineer
9 年Do u have a link from cloudera that substantiates this statement. Thx
Vice President of Engineering at Salesforce > Working on Machine Learning, Gen AI
9 年spark also uses map reduce