#15 What is Spark

#15 What is Spark

Apache Spark is an open-source cluster computing framework. Its primary purpose is to handle the real-time generated data.

  • Spark was built on the top of the Hadoop MapReduce.?
  • It was optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes data to and from computer hard drives.?
  • So, Spark processes the data much quicker than other alternatives.

Features of Apache Spark

  • Fast - It provides high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • Easy to Use - It facilitates writing applications in Java, Scala, Python, R, and SQL. It also provides more than 80 high-level operators.
  • Generality - It provides a collection of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
  • Lightweight - It is a light unified analytics engine which is used for large scale data processing.
  • Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Before starting to learn Apache Spark, it's recommended to understand some basic concepts about Big Data and Hadoop. If you haven't explored these concepts yet, Go through the link below for basic fundamentals.

https://www.dhirubhai.net/feed/update/urn:li:activity:7172098113011134464/

Follow Mohammad Azzam for more such content on Spark and Data Engineering Concepts.

要查看或添加评论,请登录

Mohammad Azzam的更多文章

  • #33 what is broadcast join in spark

    #33 what is broadcast join in spark

    In Apache Spark, a "broadcast join" is a type of join operation used to optimize performance when joining large and…

  • #32 Repartition vs coalsece

    #32 Repartition vs coalsece

    repartition() and coalesce() are both methods in Apache Spark used to manage the number of partitions in an RDD or…

  • #31: Partitions in spark

    #31: Partitions in spark

    In Apache Spark, partitions are the basic units of parallelism and data distribution. When you create an RDD (Resilient…

  • #30 Task, job and stage in spark

    #30 Task, job and stage in spark

    In Apache Spark, jobs, tasks, and stages are fundamental concepts that play a crucial role in the distributed execution…

  • #29 ReduceBy() key vs groupBy() key in spark RDD

    #29 ReduceBy() key vs groupBy() key in spark RDD

    In the context of Apache Spark's Resilient Distributed Datasets (RDDs), both reduceByKey and groupByKey are…

  • #28: reduce VS reduceByKey in Apache Spark RDDs

    #28: reduce VS reduceByKey in Apache Spark RDDs

    reduce() and reduceByKey() are two distinct operations available in Apache Spark, a distributed computing framework for…

    2 条评论
  • #27 Narrow vs Wide Transformations in Spark

    #27 Narrow vs Wide Transformations in Spark

    In Apache Spark, transformations are broadly categorized into two types based on how they operate across partitions of…

  • #26: Shuffling and Sorting in Apache Spark

    #26: Shuffling and Sorting in Apache Spark

    Shuffling and sorting are fundamental operations in Apache Spark, especially in distributed data processing. They play…

  • #25: Transformation and Action in Apache Spark

    #25: Transformation and Action in Apache Spark

    In Apache Spark, there are two types of operations that can be applied to RDDs (Resilient Distributed Datasets):…

  • #24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

    #24: 10 Majorly Used Transformations in RDDs (Resilient Distributed Datasets)

    Certainly! Here are 10 majorly used transformations in RDDs (Resilient Distributed Datasets) in Apache Spark:…

社区洞察

其他会员也浏览了