ELEPHANT WORKS BETTER WHEN SPARKED
arun pandey
30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers
It might be another few years when it’s omnipotent but even today analytics seems to be picking up really well, It is actually becoming one of the core tools to derive a competitive advantage rather than being a supplementary activity. There are some young setups who are scaling very well but at the same time it’s also a big challenge for few others, how do you keep pace with constant technology disruptions. You just started getting comfortable with MapReduce and suddenly you have spark, your technology team has just mastered Scala/python but Julia is around the corner and by the way you haven’t yet seen Jugaad analytics from Yottaasys. A similar question surfaced recently while we were pitching one of our core products Next Product to Buy (NPTB) to a financial customer, the concern raised was we just finished our POC using MapReduce on Hadoop and Spark is being recommended now, will spark replace Hadoop and how does Spark compares with Hadoop? One of the my young guns came up with a very smart answer, “The elephant stays in the room but only performs better when sparked”
That was one of the smartest answers which summarized spark and its correlation with Hadoop so well, he was very right as the elephant stays in the room. Spark is a complimentary framework which works along with Hadoop for all big data and analytics products/solutions. The component which gets replaced is MapReduce and there are very valid reasons why more and more data sciences products are based on a combination of Spark, Hadoop and python combo. In this blog I will share some of the salient features of this combo and why Spark and not MapReduce
Spark is complimentary to HDFS and not a substitute: Spark runs on top of existing Hadoop Distributed File System (HDFS) infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster or Hadoop v2 YARN cluster. Spark works well with even Apache Mesos
Spark is substitute for MapReduce: MapReduce has been around for quiet sometime, well as long when the first white paper by Mr Google was released. A great beginning for the early days but MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.
Now the Job output in between each step has to be written back to disk (HDFS) before the next step can begin, this approach tends to be slow due to replication & disk storage. Also, Hadoop solutions typically include clusters that are hard to set up and manage. It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing).
If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely.
How Spark overtakes MapReduce
- Spark implements the in memory paradigm to its core and any in data generated in intermediate steps is actually saved in the shared memory and not the disk
- Stores data into memory for iterative functions and algorithms, once the memory limit is reached then only it spills over the data to the disk
- Optimizes arbitrary operator graphs.
- Spark supports lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow, any large volume data computation is not done in an eager manner and space optimization is the best for big data activities.
- Spark was written in Scala and provides concise and consistent APIs in Scala, Java and Python. Spark offers interactive shell for Scala and Python while this is yet not available for Java.
- Spark, HDFS and python/Scala are one of the best combinations for any decision sciences processing. Julia can be a killer combination which might substitute Python/Scala due to its low level (GL) execution but still some more time to have a stable Julia….
30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers
9 年Ur welcome summit
Co-Founder, CEO | Your potential Data & AI outsourcing partner to build custom AI Apps (GenAI) with cost efficiency.
9 年Very well simplified for a quick understanding. Thanks.
30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers
9 年Thanks manoj
30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers
9 年Thanks vikram
Director Kafka Engineer
9 年Nice article Arun