Hadoop Vs Spark

Hadoop Vs Spark

Hadoop Defined

Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big data) across computer clusters using simple programming models. Hadoop can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power. Hadoop, in essence, is the ubiquitous 800-lb big data gorilla in the big data analytics space.

Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are:

  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • Hadoop YARN
  • Hadoop MapReduce

Although the above four modules comprise Hadoop’s core, there are several other modules. These include AmbariAvroCassandraHivePigOozieFlume, and Sqoop, which further enhance and extend Hadoop’s power and reach into big data applications and large data set processing.

Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce.

Hadoop is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time.

MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks.

See user reviews of Hadoop.

Spark Defined

The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.

Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk. Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.

Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module.

Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.

Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn’t have its own distributed filesystem, but can use HDFS.

Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section.

要查看或添加评论,请登录

Mansi Mishra的更多文章

  • UI Developer Job

    UI Developer Job

    UI Developer Job We are looking for a creative, skilled UI Developer to design websites with functionality and…

  • Project Manager

    Project Manager

    Project Manager: Job Summary To provide advanced project management services and to develop, maintain and enhance…

  • Alteryx Professionals Required

    Alteryx Professionals Required

    3 -8 years of experience performing detailed data analysis. Track record in data analytics, data science, modeling and…

  • Tableau VS Talend

    Tableau VS Talend

    Difference in layers First of all, if you have worked with Tableau, PowerBI or QlikView before, chances are good you…

  • AWS or Azure

    AWS or Azure

    With Cloud Computing at its prime, various cloud service vendors have contested to claim supremacy in the Cloud domain.…

  • Data Visualizat

    Data Visualizat

    The concept of using pictures to understand data has been around for centuries, from maps and graphs in the 17th…

  • Data Scientist Skills

    Data Scientist Skills

    A Data Scientist creates predictive models and performs custom analysis on the data according to company requirements…

  • Data Engineer

    Data Engineer

    You have experience with client projects and in handling vast amounts of data – working on database design and…

  • Types of Cloud Deployment

    Types of Cloud Deployment

    Although the term “cloud” often gives cloud computing a somewhat mystical connotation, in reality, it isn’t all that…

  • Robotics Process Automation Developer

    Robotics Process Automation Developer

    Robotics Process Automation Developer We are looking for a Robotic Process Automation (RPA) Developer to join our…

社区洞察

其他会员也浏览了