登录查看更多内容

ELEPHANT WORKS BETTER WHEN SPARKED

arun pandey

30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers

发布日期: 2015年9月7日

It might be another few years when it’s omnipotent but even today analytics seems to be picking up really well, It is actually becoming one of the core tools to derive a competitive advantage rather than being a supplementary activity. There are some young setups who are scaling very well but at the same time it’s also a big challenge for few others, how do you keep pace with constant technology disruptions. You just started getting comfortable with MapReduce and suddenly you have spark, your technology team has just mastered Scala/python but Julia is around the corner and by the way you haven’t yet seen Jugaad analytics from Yottaasys. A similar question surfaced recently while we were pitching one of our core products Next Product to Buy (NPTB) to a financial customer, the concern raised was we just finished our POC using MapReduce on Hadoop and Spark is being recommended now, will spark replace Hadoop and how does Spark compares with Hadoop? One of the my young guns came up with a very smart answer, “The elephant stays in the room but only performs better when sparked”

That was one of the smartest answers which summarized spark and its correlation with Hadoop so well, he was very right as the elephant stays in the room. Spark is a complimentary framework which works along with Hadoop for all big data and analytics products/solutions. The component which gets replaced is MapReduce and there are very valid reasons why more and more data sciences products are based on a combination of Spark, Hadoop and python combo. In this blog I will share some of the salient features of this combo and why Spark and not MapReduce

Spark is complimentary to HDFS and not a substitute: Spark runs on top of existing Hadoop Distributed File System (HDFS) infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster or Hadoop v2 YARN cluster. Spark works well with even Apache Mesos

Spark is substitute for MapReduce: MapReduce has been around for quiet sometime, well as long when the first white paper by Mr Google was released. A great beginning for the early days but MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.

Now the Job output in between each step has to be written back to disk (HDFS) before the next step can begin, this approach tends to be slow due to replication & disk storage. Also, Hadoop solutions typically include clusters that are hard to set up and manage. It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing).

If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely.

How Spark overtakes MapReduce

Spark implements the in memory paradigm to its core and any in data generated in intermediate steps is actually saved in the shared memory and not the disk
Stores data into memory for iterative functions and algorithms, once the memory limit is reached then only it spills over the data to the disk
Optimizes arbitrary operator graphs.
Spark supports lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow, any large volume data computation is not done in an eager manner and space optimization is the best for big data activities.
Spark was written in Scala and provides concise and consistent APIs in Scala, Java and Python. Spark offers interactive shell for Scala and Python while this is yet not available for Java.
Spark, HDFS and python/Scala are one of the best combinations for any decision sciences processing. Julia can be a killer combination which might substitute Python/Scala due to its low level (GL) execution but still some more time to have a stable Julia….

arun pandey

9 年

Ur welcome summit

Sumit Arora

Co-Founder, CEO | Your potential Data & AI outsourcing partner to build custom AI Apps (GenAI) with cost efficiency.

9 年

Very well simplified for a quick understanding. Thanks.

arun pandey

9 年

Thanks manoj

arun pandey

9 年

Thanks vikram

Manoj Agrawal (CSM)

Director Kafka Engineer

9 年

Nice article Arun

查看更多评论

要查看或添加评论，请登录

arun pandey的更多文章

The best AI course I have ever come across ??

2022年10月20日

The best AI course I have ever come across ??

The ultimate course to kickstart AI beginners and Charm AI experts is available and its absolutely free! Making Friends…
Top 5 Latest Machine Learning Trends

2022年10月19日

Top 5 Latest Machine Learning Trends

1 Understanding the Blackbox: Toy Models of Superposition An artificial neural network would be most convenient if each…
Data Analytics Solutions for Procurement and Sourcing

2017年2月10日

Data Analytics Solutions for Procurement and Sourcing

Traditional supply chain execution systems are not able to meet the challenges of global operating systems, pricing…

1 条评论
STARTUP KA KEEDA

2016年9月9日

STARTUP KA KEEDA

"INTIMATE NOTES FOR THE NEXT PASSIONATE ENTREPRENEUR" Yesterday one of my friends asked me if I can spare sometime to…

19 条评论
TEN MUST READS FOR ENTREPRENEURS

2016年2月15日

TEN MUST READS FOR ENTREPRENEURS

Well it has been 1 year 1 month and 15 days with my latest startup Yottaasys, it has been the best learning period of…

15 条评论
HOW TO FUND UR START-UP WITHOUT A VC!

2015年10月26日

HOW TO FUND UR START-UP WITHOUT A VC!

Alas 90% of the start-ups die during their first year and only 1% make it through their third year, one of the key…

74 条评论
AN ANSWER TO OUR TRAFFIC PROBLEMS

2015年9月22日

AN ANSWER TO OUR TRAFFIC PROBLEMS

20-Sep-2015 was a special day, it was a day packed with lots of fun and so many learning's. My partner has just joined…

1 条评论
18000 START-UPS AND ONE AKSHAYA PATRA

2015年9月14日

18000 START-UPS AND ONE AKSHAYA PATRA

Recently met a core member from one of the most innovative start-up of our times, His persona and the mention of the…

3 条评论
Innovation, solving the most common problems brilliantly!

2015年8月19日

Innovation, solving the most common problems brilliantly!

Background: A couple of days back after a really long day at my start-up I found some time to unwind myself and started…
Next product to Buy

2015年8月18日

Next product to Buy

Introduction: how often if happens that you do an on-line transaction using the payment gateway for a famous bank and…

4 条评论

See all articles

ELEPHANT WORKS BETTER WHEN SPARKED

arun pandey

30X Certified |Accomplished Generative AI Leader | Data Science Guru | Champion of COEs | Startup Enthusiast | Entrepreneur | Author | AI Innovator | Ranked Among Top 10 Global AI Influencers

arun pandey的更多文章

社区洞察

其他会员也浏览了

Spark Vs Hadoop Map Reduce

Hadoop File Formats, when and what to use?

How I've set up my first Hadoop / Spark cluster: Preparation

Getting started with Apache Spark

Evolution of Apache's Big Data Ecosystem

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

What is Hive?

Apache Spark vs. Hadoop MapReduce

Best Ways to Use Hadoop with R for Extraordinary Results!

arun pandey的更多文章

The best AI course I have ever come across ??

Top 5 Latest Machine Learning Trends

Data Analytics Solutions for Procurement and Sourcing

STARTUP KA KEEDA

TEN MUST READS FOR ENTREPRENEURS

HOW TO FUND UR START-UP WITHOUT A VC!

AN ANSWER TO OUR TRAFFIC PROBLEMS

18000 START-UPS AND ONE AKSHAYA PATRA

Innovation, solving the most common problems brilliantly!

Next product to Buy

社区洞察

其他会员也浏览了

Spark Vs Hadoop Map Reduce

Hadoop File Formats, when and what to use?

How I've set up my first Hadoop / Spark cluster: Preparation

Getting started with Apache Spark

Evolution of Apache's Big Data Ecosystem

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

What is Hive?

Apache Spark vs. Hadoop MapReduce

Best Ways to Use Hadoop with R for Extraordinary Results!