Comparison between Hadoop, Spark and Storm

Comparison between Hadoop, Spark and Storm

Real-time business intelligence (RTBI) is the need of the hour for geographically dispersed locations of your business to have a real-time synchronization thereby increasing the efficiency of your resources. The information thus originating from any part of the business, is made available to all accessing points, in real time. RTBI is gradually replacing the traditional business intelligence, with its superior feature of analyzing data in real time and then distributing it eventually. RTBI must be incorporated within a company owing to the business needs of a company, or it might cost heavily on your pocket.

To cater to the emerging needs of businesses, a number of platforms have grown in recent years. A few prominent among them are Apache Hadoop, Apache Storm and Apache Spark, which are all open source frameworks designed for crunching huge data real fast.

Apache Hadoop

  • The most basic form of RTBI, it is used to store large data sets and run analytic processes on various segments of the data.
  • It is favourable for various organizations as it is low on budget and at the same time it can store scores of data and analyze them effectively in stipulated time. It is also chosen by many organizations due to its robust architecture and data warehousing feature.
  • The network architecture of Hadoop is robust in nature as large data applications continue to run even if there are minute failures in some clusters or servers of the network.
  • A major disadvantage of Hadoop MapReduce is the poor computation results in real time. It is because Hadoop processes data in batches, one job at a time.

Apache Spark

  • This can be termed an advanced version of Hadoop with data parallel functioning.
  • The workflow architecture of Spark is designed in Hadoop MapReduce, but has independent processes for continuous batch processing across varying short interval of time.
  • For streaming of data, Spark does not rely on Hadoop YARN but it follows its own application program index for streaming. With these features, there can be situations where it turns out to be 100 times faster than Hadoop.
  • This version loses points on the fact that it does not have its own distributed storage system.

Apache Storm

  • This open source distributed computing system does parallel tasks thereby minimizing the queue of jobs and hence resulting in faster computations.
  • It follows a network topology, in which tasks flow independently in the form of directed acyclic graphs (DAG).
  • The data processes in Storm do not run on Hadoop clusters but it uses Zookeeper and has its indigenous minion worker to run its processes.
  • It can write and read files from HDFS.

 

The working comparison points:

1.   Performance

  • Hadoop: Hadoop map takes some time to backtrack after a map action hence a little lagging, however as the process is killed as soon as the task is completed, it can run along with other services which demand resource with just a slight degradation in performance.
  • Spark: It does the processes in memory, for which it loads them into the memory and stores it for caching, however, its performance degrades as it runs above of Hadoop YARN with various services demanding resources. Hence it is good for clusters which entirely fit in the memory.
  • Storm: Storm produces results with a latency of milliseconds and is required when the latency has to be minimized without data loss.

2.   Data handling Topology

  • Hadoop: it is best suited for batch processing and cannot handle big data application which requires real time operations.
  • Spark: it is designed for high performance, hence can be used for both batch and real time processing of data. It avoids maintenance of overhead data as it uses single platform for everything
  • Storm: This is a batch processing engine which supports micro batching, as well as stream processing.

3.   Development

  • Hadoop: This is written in Java and implemented using Apache pig. However, SQL compatibility can be established by using Hive over Hadoop.
  • Spark: For implementation it uses Scala tuples, a bit difficult to implement over java.
  • Storm: it uses DAG’s on every node and data transfer in between them is done through Storm tuples.

It is quite important to choose the best framework for your business. The choice should be done while considering a multitude of factors such as Performance, Scalability, Cost of Development, Data Processing models, Message Delivery Guarantees, Latency, and Fault Tolerance.

要查看或添加评论,请登录

Ahmed Takolia的更多文章

社区洞察

其他会员也浏览了