ELEPHANT WORKS BETTER WHEN SPARKED

ELEPHANT WORKS BETTER WHEN SPARKED

It might be another few years when it’s omnipotent but even today analytics seems to be picking up really well, It is actually becoming one of the core tools to derive a competitive advantage rather than being a supplementary activity. There are some young setups who are scaling very well but at the same time it’s also a big challenge for few others, how do you keep pace with constant technology disruptions. You just started getting comfortable with MapReduce and suddenly you have spark, your technology team has just mastered Scala/python but Julia is around the corner and by the way you haven’t yet seen Jugaad analytics from Yottaasys.  A similar question surfaced recently while we were pitching one of our core products Next Product to Buy (NPTB) to a financial customer, the concern raised was we just finished our POC using MapReduce on Hadoop and Spark is being recommended now, will spark replace Hadoop and how does Spark compares with Hadoop? One of the my young guns came up with a very smart answer, “The elephant stays in the room but only performs better when sparked”

That was one of the smartest answers which summarized spark and its correlation with Hadoop so well, he was very right as the elephant stays in the room. Spark is a complimentary framework which works along with Hadoop for all big data and analytics products/solutions. The component which gets replaced is MapReduce and there are very valid reasons why more and more data sciences products are based on a combination of Spark, Hadoop and python combo. In this blog I will share some of the salient features of this combo and why Spark and not MapReduce

Spark is complimentary to HDFS and not a substitute: Spark runs on top of existing Hadoop Distributed File System (HDFS) infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster or Hadoop v2 YARN cluster. Spark works well with even Apache Mesos

Spark is substitute for MapReduce: MapReduce has been around for quiet sometime, well as long when the first white paper by Mr Google was released. A great beginning for the early days but MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.

Now the Job output in between each step has to be written back to disk (HDFS) before the next step can begin, this approach tends to be slow due to replication & disk storage. Also, Hadoop solutions typically include clusters that are hard to set up and manage. It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing).

If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely.

How Spark overtakes MapReduce

  • Spark implements the in memory paradigm to its core and any in data generated in intermediate steps is actually saved in the shared memory and not the disk
  • Stores data into memory for iterative functions and algorithms, once the memory limit is reached then only it spills over the data to the disk
  • Optimizes arbitrary operator graphs.
  • Spark supports lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow, any large volume data computation is not done in an eager manner and space optimization is the best for big data activities.
  • Spark was written in Scala and provides concise and consistent APIs in Scala, Java and Python. Spark offers interactive shell for Scala and Python while this is yet not available for Java.
  • Spark, HDFS and python/Scala are one of the best combinations for any decision sciences processing. Julia can be a killer combination which might substitute Python/Scala due to its low level (GL) execution but still some more time to have a stable Julia….
arun pandey

30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers

9 年

Ur welcome summit

回复
Sumit Arora

Co-Founder, CEO | Your potential Data & AI outsourcing partner to build custom AI Apps (GenAI) with cost efficiency.

9 年

Very well simplified for a quick understanding. Thanks.

回复
arun pandey

30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers

9 年

Thanks manoj

回复
arun pandey

30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers

9 年

Thanks vikram

回复
Manoj Agrawal (CSM)

Director Kafka Engineer

9 年

Nice article Arun

回复

要查看或添加评论,请登录

arun pandey的更多文章

  • The best AI course I have ever come across ??

    The best AI course I have ever come across ??

    The ultimate course to kickstart AI beginners and Charm AI experts is available and its absolutely free! Making Friends…

  • Top 5 Latest Machine Learning Trends

    Top 5 Latest Machine Learning Trends

    1 Understanding the Blackbox: Toy Models of Superposition An artificial neural network would be most convenient if each…

  • Data Analytics Solutions for Procurement and Sourcing

    Data Analytics Solutions for Procurement and Sourcing

    Traditional supply chain execution systems are not able to meet the challenges of global operating systems, pricing…

    1 条评论
  • STARTUP KA KEEDA

    STARTUP KA KEEDA

    "INTIMATE NOTES FOR THE NEXT PASSIONATE ENTREPRENEUR" Yesterday one of my friends asked me if I can spare sometime to…

    19 条评论
  • TEN MUST READS FOR ENTREPRENEURS

    TEN MUST READS FOR ENTREPRENEURS

    Well it has been 1 year 1 month and 15 days with my latest startup Yottaasys, it has been the best learning period of…

    15 条评论
  • HOW TO FUND UR START-UP WITHOUT A VC!

    HOW TO FUND UR START-UP WITHOUT A VC!

    Alas 90% of the start-ups die during their first year and only 1% make it through their third year, one of the key…

    74 条评论
  • AN ANSWER TO OUR TRAFFIC PROBLEMS

    AN ANSWER TO OUR TRAFFIC PROBLEMS

    20-Sep-2015 was a special day, it was a day packed with lots of fun and so many learning's. My partner has just joined…

    1 条评论
  • 18000 START-UPS AND ONE AKSHAYA PATRA

    18000 START-UPS AND ONE AKSHAYA PATRA

    Recently met a core member from one of the most innovative start-up of our times, His persona and the mention of the…

    3 条评论
  • Innovation, solving the most common problems brilliantly!

    Innovation, solving the most common problems brilliantly!

    Background: A couple of days back after a really long day at my start-up I found some time to unwind myself and started…

  • Next product to Buy

    Next product to Buy

    Introduction: how often if happens that you do an on-line transaction using the payment gateway for a famous bank and…

    4 条评论

社区洞察

其他会员也浏览了