Befriending: Apache Hadoop and Spark
If you've ever got into a discussion involving Big data then Apache Spark and Hadoop, the two most popular projects of Apache Software Foundation are synchronous terms. So, let's take both of them in a nutshell.
1) Doing things differently: Even though they are under umbrella of Big data frameworks, they don't exactly work for the same purpose. Hadoop on the large scale brings the distributed data infrastructure capabilities by placing massive data over multiple nodes belonging to a cluster using commodity hardware. On the other side, Apache Spark does not provide data storage capabilities, it just focuses on improving the data processing part.
Key take away- Hadoop brings in HDFS + Processing, Spark brings Processing to HDFS(May or May not).
2) Use any one or both, up to you: Hadoop, with HDFS ( Hadoop Distributed File System) brings in the Map Reduce model for data crunching whereas Apache Spark does not comes with it's exclusive file system. So, when using Hadoop you do not need Apache Spark's processing capabilities. On contrary, since Spark does not have a file system of its own, it can operate on HDFS (Hadoop Distributed File System) as it's data source.
3) Apache Spark, the Horse and Apache Hadoop, the Elephant : As you can rightly guess, Spark can run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. A normal Hadoop Map Reduce Job looks like, Read the data from family of nodes, do some operation and finally write the results to a cluster. Again, start from Point 0, read the updated data, do some operation and write the results(Batch Processing). In Apache Spark, a normal Job looks like Read the whole data into memory(in-memory features), process and write the full result. This drastically improves the graph of Apache Spark in terms of Speed because of in-memory computation.
4) Do you really need 100x Faster Apache Spark? : Ask yourself on the basis of your use case. If your requirements are suitable for Batch processing then choose Hadoop else if your requirement demands analyzing the streaming data then Apache Spark is your call. Typical use case for spark should be like real time advertisement tracking, Cyber security, online recommendations, etc.
5) Fault recovery: Which one of them has quick fix?
In Apache Hadoop, the data is written to disk after every operation, it is considered to be highly resilient. Also, the recovery extended by the replication factors in HDFS. Similarly, even though Spark being in-memory, it provides similar fault recovery by virtue of Resilient Distributed Data sets across its network.