登录查看更多内容

The Big 'Big Data' Question: Hadoop or Spark?

Bernard Marr

?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor

发布日期: 2015年7月19日

One question I get asked a lot by my clients recently is: Should we go for Hadoop or Spark as our big data framework? Spark has overtaken Hadoop as the most active open source Big Data project. While they are not directly comparable products, they both have many of the same uses.

In order to shed some light onto the issue of “Spark versus Hadoop” I thought an article explaining the essential differences and similarities of each might be useful. As always, I have tried to keep it accessible to anyone, including those without a background in computer science.

Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks.

Hadoop, for many years, was the leading open source Big Data framework but recently the newer and more advanced Spark has become the more popular of the two Apache Software Foundation tools.

However they do not perform exactly the same tasks, and they are not mutually exclusive, as they are able to work together. Although Spark is reported to work up to 100 times faster than Hadoop in certain circumstances, it does not provide its own distributed storage system.

Distributed storage is fundamental to many of today’s Big Data projects as it allows vast multi-petabyte datasets to be stored across an almost infinite number of everyday computer hard drives, rather than involving hugely costly custom machinery which would hold it all on one device. These systems are scalable, meaning that more drives can be added to the network as the dataset grows in size.

As I mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

What really gives Spark the edge over Hadoop is speed. Spark handles most of its operations “in memory” – copying them from the distributed physical storage into far faster logical RAM memory. This reduces the amount of time consuming writing and reading to and from slow, clunky mechanical hard drives that needs to be done under Hadoop’s MapReduce system.

MapReduce writes all of the data back to the physical storage medium after each operation. This was originally done to ensure a full recovery could be made in case something goes wrong – as data held electronically in RAM is more volatile than that stored magnetically on disks. However Spark arranges data in what are known as Resilient Distributed Datasets, which can be recovered following failure.

Spark’s functionality for handling advanced data processing tasks such as real time stream processing and machine learning is way ahead of what is possible with Hadoop alone. This, along with the gain in speed provided by in-memory operations, is the real reason, in my opinion, for its growth in popularity. Real-time processing means that data can be fed into an analytical application the moment it is captured, and insights immediately fed back to the user through a dashboard, to allow action to be taken. This sort of processing is increasingly being used in all sorts of Big Data applications, for example recommendation engines used by retailers, or monitoring the performance of industrial machinery in the manufacturing industry.

Machine learning – creating algorithms which can “think” for themselves, allowing them to improve and “learn” through a process of statistical modelling and simulation, until an ideal solution to a proposed problem is found, is an area of analytics which is well suited to the Spark platform, thanks to its speed and ability to handle streaming data. This sort of technology lies at the heart of the latest advanced manufacturing systems used in industry which can predict when parts will go wrong and when to order replacements, and will also lie at the heart of the driverless cars and ships of the near future. Spark includes its own machine learning libraries, called MLib, whereas Hadoop systems must be interfaced with a third-party machine learning library, for example Apache Mahout.

The reality is, although the existence of the two Big Data frameworks is often pitched as a battle for dominance, that isn’t really the case. There is some crossover of function, but both are non-commercial products so it isn’t really “competition” as such, and the corporate entities which do make money from providing support and installation of these free-to-use systems will often offer both services, allowing the buyer to pick and choose which functionality they require from each framework.

Many of the big vendors (i.e Cloudera) now offer Spark as well as Hadoop, so will be in a good position to advise companies on which they will find most suitable, on a job-by-job basis. For example, if your Big Data simply consists of a huge amount of very structured data (i.e customer names and addresses) you may have no need for the advanced streaming analytics and machine learning functionality provided by Spark. This means you would be wasting time, and probably money, having it installed as a separate layer over your Hadoop storage. Spark, although developing very quickly, is still in its infancy, and the security and support infrastructure is not as advanced.

The increasing amount of Spark activity taking place (when compared to Hadoop activity) in the open source community is, in my opinion, a further sign that everyday business users are finding increasingly innovative uses for their stored data. The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data.

Thank you for reading my post. Here at LinkedIn and at Forbes I regularly write about management, technology and the mega-trend that is Big Data. If you would like to read my regular posts then please click 'Follow' and feel free to also connect via Twitter, Facebook and The Advanced Performance Institute.

You might also be interested in my new big data case study collection, which you can download for free from here: Big Data Case Study Collection: 7 Amazing Companies That Really Get Big Data.

Here are some other recent articles I have written:

What is Big Data - A complete overview
How is Big Data Used In Practice? 10 Use Cases Everyone Must Read
How To Make A Billion Dollars From Big Data
Walmart: The Big Data Skills Crisis and Recruiting Analytics Talent
Big Data-As-A-Service Is Next Big Thing
How Big Data Is Changing Healthcare
Where Big Data Projects Fail

About : Bernard Marr is a globally recognized expert in big data, analytics and enterprise performance. He helps companies improve decision-making and performance using data. His new book is Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance. You can read a free sample chapter here.

Photo: Shutterstock.com

Diana K. Michael

Quality Control Manager at AT&T Mobility

8 年

I too had no clue about either Hadoop or Spark. Thanks ever so much Bernard Marr for explaining the differences between the two!!

LaVette Gordon

Certified Reiki Master | Birth Doula | Lead Scrum Master |Product Owner Delegate | Certified SAFe 6.0 Practitioner

9 年

I was not too informed about Spark until reading this article. It is very informative and I appreciate the comparison with Hadoop (which I know more about).

Victor Pancras

9 年

Thanks. Informative.

Pawankumar Thakare

Vice President at Barclays

9 年

good article

Narayanan CK

9 年

Excellent article, more so from the IOT perspective.

查看更多评论

要查看或添加评论，请登录

Bernard Marr的更多文章

The AI Revolution: How Predictive, Prescriptive, And Generative AI Are Reshaping Our World

2024年11月1日

The AI Revolution: How Predictive, Prescriptive, And Generative AI Are Reshaping Our World

Thank you for reading my latest article The AI Revolution: How Predictive, Prescriptive, And Generative AI Are…

11 条评论
The 5 Most In-Demand Skills In 2025

2024年10月30日

The 5 Most In-Demand Skills In 2025

Thank you for reading my latest article The 5 Most In-Demand Skills In 2025. Here at LinkedIn and at Forbes I regularly…

34 条评论
AI And Cucumbers: The Amazing Ways Kraft Heinz Is Using Artificial Intelligence

2024年10月28日

AI And Cucumbers: The Amazing Ways Kraft Heinz Is Using Artificial Intelligence

Thank you for reading my latest article AI And Cucumbers: The Amazing Ways Kraft Heinz Is Using Artificial…

34 条评论
AI Can Now Reason: What It Means For Business And Beyond

2024年10月27日

AI Can Now Reason: What It Means For Business And Beyond

Thank you for reading my latest article AI Can Now Reason: What It Means For Business And Beyond. Here at LinkedIn and…

28 条评论
The Rise Of Physical AI: When Intelligent Machines Meet The Real World

2024年10月25日

The Rise Of Physical AI: When Intelligent Machines Meet The Real World

Thank you for reading my latest article The Rise Of Physical AI: When Intelligent Machines Meet The Real World. Here at…

27 条评论
The Next Breakthrough In Artificial Intelligence: How Quantum AI Will Reshape Our World

2024年10月23日

The Next Breakthrough In Artificial Intelligence: How Quantum AI Will Reshape Our World

Thank you for reading my latest article The Next Breakthrough In Artificial Intelligence: How Quantum AI Will Reshape…

31 条评论
8 Workplace Trends That Will Define 2025

2024年10月21日

8 Workplace Trends That Will Define 2025

Thank you for reading my latest article 8 Workplace Trends That Will Define 2025. Here at LinkedIn and at Forbes I…

22 条评论
The Game-Changing Impact Of Generative AI On The Enterprise

2024年10月20日

The Game-Changing Impact Of Generative AI On The Enterprise

Thank you for reading my latest article The Game-Changing Impact Of Generative AI On The Enterprise. Here at LinkedIn…

21 条评论
AI And Conspiracy Theories: Can Artificial Intelligence Help Change Minds?

2024年10月18日

AI And Conspiracy Theories: Can Artificial Intelligence Help Change Minds?

Thank you for reading my latest article AI And Conspiracy Theories: Can Artificial Intelligence Help Change Minds? Here…

60 条评论
Why Hybrid AI Is The Next Big Thing In Tech

2024年10月16日

Why Hybrid AI Is The Next Big Thing In Tech

Thank you for reading my latest article Why Hybrid AI Is The Next Big Thing In Tech. Here at LinkedIn and at Forbes I…

38 条评论

See all articles

The Big 'Big Data' Question: Hadoop or Spark?

Bernard Marr

?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor

Bernard Marr的更多文章

社区洞察

其他会员也浏览了

What Are The Key Differences Between Spark And Hadoop?

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop File Formats, when and what to use?

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Hadoop: Pioneering the Era of Big Data Storage Technologies

The 9 main applications of the Hadoop Ecosystem

Bernard Marr的更多文章

The AI Revolution: How Predictive, Prescriptive, And Generative AI Are Reshaping Our World

The 5 Most In-Demand Skills In 2025

AI And Cucumbers: The Amazing Ways Kraft Heinz Is Using Artificial Intelligence

AI Can Now Reason: What It Means For Business And Beyond

The Rise Of Physical AI: When Intelligent Machines Meet The Real World

The Next Breakthrough In Artificial Intelligence: How Quantum AI Will Reshape Our World

8 Workplace Trends That Will Define 2025

The Game-Changing Impact Of Generative AI On The Enterprise

AI And Conspiracy Theories: Can Artificial Intelligence Help Change Minds?

Why Hybrid AI Is The Next Big Thing In Tech

社区洞察

其他会员也浏览了

What Are The Key Differences Between Spark And Hadoop?

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop File Formats, when and what to use?

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Hadoop: Pioneering the Era of Big Data Storage Technologies

The 9 main applications of the Hadoop Ecosystem