Apache Spark

Apache Spark

Apache Spark?is an?open-source?unified analytics engine for large-scale data processing. Spark provides an?interface?for programming entire clusters with implicit?data parallelism?and?fault tolerance. Originally developed at the?University of California, Berkeley's?AMPLab, the Spark?codebase?was later donated to the?Apache Software Foundation, which has maintained it since.

Spark comes packed with a wide range of libraries for?Machine Learning (ML) algorithms?and graph algorithms. Not just that, it also supports real-time streaming and SQL apps via Spark Streaming and Shark, respectively. The best part about using Spark is that you can write Spark apps in Java, Scala, or even Python, and these apps will run nearly ten times faster (on disk) and 100 times faster (in memory) than MapReduce apps.

Apache Spark History

Apache Spark started as a research project at the?UC Berkeley AMPLab?in 2009, and was open sourced in early 2010. Many of the ideas behind the system were presented in various?research papers?over the years.

After being released, Spark grew into a broad developer community, and moved to the?Apache Software Foundation?in 2013. Today, the project is developed collaboratively by a community of hundreds of developers from hundreds of organizations.

Spark Architecture Overview

Apache Spark has a well-defined layered architecture where all the spark components?and layers are loosely coupled. This architecture is further integrated with various extensions and libraries.?Apache Spark Architecture is based on two main abstractions:

  • Resilient Distributed Dataset (RDD)
  • Directed Acyclic Graph (DAG)

But before diving any deeper into the Spark architecture, let me explain few fundamental concepts of Spark like?Spark Eco-system and RDD. This will help you in gaining better insights.

No alt text provided for this image


Let me first explain what is Spark Eco-System.?

  1. Spark Core
  2. Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built on the top of the core allows diverse workloads for streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems.
  3. Spark Streaming
  4. Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams.
  5. Spark SQL
  6. Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
  7. GraphX
  8. GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph (a directed multigraph with properties attached to each vertex and edge).
  9. MLlib?(Machine Learning)
  10. MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.
  11. SparkR
  12. It?is an R package that provides a distributed data frame implementation. It also supports operations like selection, filtering, aggregation but on large data-sets.

As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless integrations in a complex workflow. Over this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.

Spark Eco-System

As you can see from the below image, the spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.

No alt text provided for this image


Working of Spark Architecture

As you have already seen the basic architectural overview of Apache Spark, now let’s dive deeper into its working.

In your?master node, you have the?driver program, which drives your application. The code you are writing behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.

Inside the driver program, the first thing you do is, you?create?a?Spark Context.?Assume that the Spark context is a?gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.

Now, this Spark context works with the?cluster manager?to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster.?A job is split into multiple tasks which?are distributed over the worker?node.?Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.

Worker nodes?are the slave nodes whose job is to basically execute the tasks. These tasks are then executed?on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context.

Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.

Applications of Apache Spark

As the adoption of Spark across industries continues to rise steadily, it is giving birth to unique and varied Spark applications. These Spark applications are being successfully implemented and executed in real-world scenarios. Let’s take a look at some of the most exciting Spark applications of our time!

1. Processing Streaming Data

The most wonderful aspect of Apache Spark is its ability to process streaming data. Every second, an unprecedented amount of data is generated globally. This pushes companies and businesses to process data in large bulks and analyze it in real-time. The Spark Streaming feature can efficiently handle this function. By unifying disparate data processing capabilities, Spark Streaming allows developers to use a single framework to accommodate all their processing requirements.

2. Machine Learning

Spark has commendable Machine Learning abilities. It is equipped with an integrated framework for performing advanced analytics that allows you to run repeated queries on datasets. This, in essence, is the processing of?Machine learning algorithms. Machine Learning Library (MLlib) is one of Spark’s most potent ML components.

This library can perform clustering, classification, dimensionality reduction, and much more. With MLlib, Spark can be used for many Big Data functions such as sentiment analysis, predictive intelligence, customer segmentation, and recommendation engines, among other things.

3. Fog Computing

To grasp the concept of Fog Computing is deeply entwined with the Internet of Things. IoT thrives on the idea of embedding objects and devices with sensors that can communicate amongst each other and with the user as well, thus, creating an interconnected web of devices and users. As more and more users adopt IoT platforms and more users join in the web of interconnected devices, the amount of data generated is beyond comprehension.

Apache Spark Features

1.Lighting-fast processing speed

Big Data processing is all about processing large volumes of complex data. Hence, when it comes to Big Data processing, organizations and enterprises want such frameworks that can process massive amounts of data at high speed. As we mentioned earlier, Spark apps can run up to 100x faster in memory and 10x faster on disk in Hadoop clusters.

It relies on Resilient Distributed Dataset (RDD) that allows Spark to transparently store data on memory and read/write it to disc only if needed. This helps to reduce most of the disc read and write time during data processing.

2. Ease of use

Spark allows you to write scalable applications in Java, Scala, Python, and R. So, developers get the scope to create and run Spark applications in their preferred programming languages. Moreover, Spark is equipped with a built-in set of over 80 high-level operators. You can use Spark interactively to query data from Scala, Python, R, and SQL shells.

3. It offers support for sophisticated analytics

Not only does Spark support simple “map” and “reduce” operations, but it also supports SQL queries, streaming data, and advanced analytics, including ML and graph algorithms. It comes with a powerful stack of libraries such as SQL & DataFrames and MLlib (for ML), GraphX, and?Spark Streaming. What’s fascinating is that Spark lets you combine the capabilities of all these libraries within a single workflow/application.

4. Real-time stream processing

Spark is designed to handle real-time data streaming. While MapReduce is built to handle and process the data that is already stored in Hadoop clusters, Spark can do both and also manipulate data in real-time via Spark Streaming.

Unlike other streaming solutions,?Spark Streaming can recover the lost work?and deliver the exact semantics out-of-the-box without requiring extra code or configuration. Plus, it also lets you reuse the same code for batch and stream processing and even for joining streaming data to historical data.

5. It is flexible

Spark can run independently in cluster mode, and it can also run on Hadoop YARN, Apache Mesos, Kubernetes, and even in the cloud. Furthermore, it can access diverse data sources. For instance, Spark can run on the YARN cluster manager and read any existing Hadoop data. It can read from any Hadoop data sources like HBase, HDFS, Hive, and Cassandra. This aspect of Spark makes it an ideal tool for migrating pure?Hadoop applications, provided the apps’ use-case is Spark-friendly.

6. Active and expanding community

Developers from?over 300 companies?have contributed to design and build Apache Spark. Ever since 2009, more than 1200 developers have actively contributed to making Spark what it is today! Naturally, Spark is backed by an active community of developers who work to improve its features and performance continually. To reach out to the Spark community, you can make use of?mailing lists?for any queries, and you can also attend Spark meetup groups and conferences.

Benefits of Apache Spark:

1. Speed:

When comes to Big Data, processing speed always matters. Apache Spark is wildly popular with data scientists because of its speed. Spark is 100x faster than Hadoop for large scale data processing. Apache Spark uses in-memory(RAM) computing system whereas Hadoop uses local memory space to store data. Spark can handle multiple petabytes of clustered data of more than 8000 nodes at a time.?

2. Ease of Use:

Apache Spark carries easy-to-use APIs for operating on large datasets. It offers over 80 high-level operators that make it easy to build parallel apps.

3. Advanced Analytics:

Spark not only supports ‘MAP’ and ‘reduce’. It also supports Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc.

4. Dynamic in Nature:

With Apache Spark, you can easily develop parallel applications. Spark offers you over 80 high-level operators.

5. Multilingual:

Apache Spark supports many languages for code writing such as Python, Java, Scala, etc.

6. Apache Spark is powerful:

Apache Spark can handle many analytics challenges because of its low-latency in-memory data processing capability. It has well-built libraries for graph analytics algorithms and machine learning.

7. Increased access to Big data:

Apache Spark is opening up various opportunities for big data and making As per the recent survey conducted by IBM’s announced that it will educate more than 1 million data engineers and data scientists on Apache Spark.?

8. Demand for Spark Developers:

Apache Spark not only benefits your organization but you as well. Spark developers are so in-demand that companies offering attractive benefits and providing flexible work timings just to hire experts skilled in Apache Spark. As per?PayScale?the average salary for?Data Engineer with Apache Spark skills is $100,362. For people who want to make a career in the big data, technology can learn?Apache Spark. You will find various ways to bridge the skills gap for getting data-related jobs, but the best way is to take formal training which will provide you hands-on work experience and also learn through hands-on projects.

Cons of Apache Spark:

1. No automatic optimization process:

In the case of Apache Spark, you need to optimize the code manually since it doesn’t have any automatic code optimization process. This will turn into a disadvantage when all the other technologies and platforms are moving towards automation.

2. File Management System:

Apache Spark doesn’t come with its own file management system. It depends on some other platforms like Hadoop or other cloud-based platforms.

3. Fewer Algorithms:

There are fewer algorithms present in the case of Apache Spark Machine Learning Spark MLlib. It lags behind in terms of a number of available algorithms.

4. Small Files Issue:

One more reason to blame Apache Spark is the issue with small files. Developers come across issues of small files when using Apache Spark along with Hadoop. Hadoop Distributed File System (HDFS) provides a limited number of large files instead of a large number of small files.

5. Window Criteria:

Data in Apache Spark divides into small batches of a predefined time interval. So Apache won't support record-based window criteria. Rather, it offers time-based window criteria.

6. Doesn’t suit for a multi-user environment:

Yes, Apache Spark doesn’t fit for a multi-user environment. It is not capable of handling more users concurrency.

To sum up, in light of the good, the bad and the ugly, Spark is a conquering tool when we view it from outside. We have seen a drastic change in the performance and decrease in the failures across various projects executed in Spark. Many applications are being moved to Spark for the efficiency it offers to developers. Using Apache Spark can give any business a boost and help foster its growth. It is sure that you will also have a bright future!


要查看或添加评论,请登录

Anubhuti Kiran的更多文章

  • Search engine optimization (SEO)

    Search engine optimization (SEO)

    Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website…

  • Root cause analysis

    Root cause analysis

    Root cause analysis usually referred to as RCA is an approach used to analyze serious problems before trying to solve…

  • HubSpot

    HubSpot

    HubSpot is a platform provider that brings entire companies together to optimise workflow, information sharing and…

  • Digital marketing

    Digital marketing

    Digital marketing is any marketing initiative that leverages online media and the internet through connected devices…

  • Oracle Database

    Oracle Database

    Oracle Database (commonly referred to as Oracle DBMS or simply as Oracle) is a multi-model database management system…

  • Microsoft Dynamics 365

    Microsoft Dynamics 365

    Dynamics 365 is a set of interconnected, modular Software-as-a-Service (SaaS) applications and services designed to…

  • Digital Analytics

    Digital Analytics

    Digital Analytics is now a key part of the digital marketing strategy of a company. With many organizations adopting…

  • QlikView

    QlikView

    QlikView is a leading Business Discovery Platform. It is unique in many ways as compared to the traditional BI…

  • KNIME

    KNIME

    KNIME, the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform…

  • Apache Hadoop

    Apache Hadoop

    Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to…

社区洞察

其他会员也浏览了