Apache Spark
Apache Spark?is an?open-source?unified analytics engine for large-scale data processing. Spark provides an?interface?for programming entire clusters with implicit?data parallelism?and?fault tolerance. Originally developed at the?University of California, Berkeley's?AMPLab, the Spark?codebase?was later donated to the?Apache Software Foundation, which has maintained it since.
Spark comes packed with a wide range of libraries for?Machine Learning (ML) algorithms?and graph algorithms. Not just that, it also supports real-time streaming and SQL apps via Spark Streaming and Shark, respectively. The best part about using Spark is that you can write Spark apps in Java, Scala, or even Python, and these apps will run nearly ten times faster (on disk) and 100 times faster (in memory) than MapReduce apps.
Apache Spark History
Apache Spark started as a research project at the?UC Berkeley AMPLab?in 2009, and was open sourced in early 2010. Many of the ideas behind the system were presented in various?research papers?over the years.
After being released, Spark grew into a broad developer community, and moved to the?Apache Software Foundation?in 2013. Today, the project is developed collaboratively by a community of hundreds of developers from hundreds of organizations.
Spark Architecture Overview
Apache Spark has a well-defined layered architecture where all the spark components?and layers are loosely coupled. This architecture is further integrated with various extensions and libraries.?Apache Spark Architecture is based on two main abstractions:
But before diving any deeper into the Spark architecture, let me explain few fundamental concepts of Spark like?Spark Eco-system and RDD. This will help you in gaining better insights.
Let me first explain what is Spark Eco-System.?
As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless integrations in a complex workflow. Over this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.
Spark Eco-System
As you can see from the below image, the spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
Working of Spark Architecture
As you have already seen the basic architectural overview of Apache Spark, now let’s dive deeper into its working.
In your?master node, you have the?driver program, which drives your application. The code you are writing behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.
Inside the driver program, the first thing you do is, you?create?a?Spark Context.?Assume that the Spark context is a?gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.
Now, this Spark context works with the?cluster manager?to manage various jobs. The driver program & Spark context takes care of the job execution within the cluster.?A job is split into multiple tasks which?are distributed over the worker?node.?Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.
Worker nodes?are the slave nodes whose job is to basically execute the tasks. These tasks are then executed?on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context.
Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.
Applications of Apache Spark
As the adoption of Spark across industries continues to rise steadily, it is giving birth to unique and varied Spark applications. These Spark applications are being successfully implemented and executed in real-world scenarios. Let’s take a look at some of the most exciting Spark applications of our time!
1. Processing Streaming Data
The most wonderful aspect of Apache Spark is its ability to process streaming data. Every second, an unprecedented amount of data is generated globally. This pushes companies and businesses to process data in large bulks and analyze it in real-time. The Spark Streaming feature can efficiently handle this function. By unifying disparate data processing capabilities, Spark Streaming allows developers to use a single framework to accommodate all their processing requirements.
2. Machine Learning
Spark has commendable Machine Learning abilities. It is equipped with an integrated framework for performing advanced analytics that allows you to run repeated queries on datasets. This, in essence, is the processing of?Machine learning algorithms. Machine Learning Library (MLlib) is one of Spark’s most potent ML components.
This library can perform clustering, classification, dimensionality reduction, and much more. With MLlib, Spark can be used for many Big Data functions such as sentiment analysis, predictive intelligence, customer segmentation, and recommendation engines, among other things.
3. Fog Computing
To grasp the concept of Fog Computing is deeply entwined with the Internet of Things. IoT thrives on the idea of embedding objects and devices with sensors that can communicate amongst each other and with the user as well, thus, creating an interconnected web of devices and users. As more and more users adopt IoT platforms and more users join in the web of interconnected devices, the amount of data generated is beyond comprehension.
Apache Spark Features
1.Lighting-fast processing speed
Big Data processing is all about processing large volumes of complex data. Hence, when it comes to Big Data processing, organizations and enterprises want such frameworks that can process massive amounts of data at high speed. As we mentioned earlier, Spark apps can run up to 100x faster in memory and 10x faster on disk in Hadoop clusters.
It relies on Resilient Distributed Dataset (RDD) that allows Spark to transparently store data on memory and read/write it to disc only if needed. This helps to reduce most of the disc read and write time during data processing.
2. Ease of use
Spark allows you to write scalable applications in Java, Scala, Python, and R. So, developers get the scope to create and run Spark applications in their preferred programming languages. Moreover, Spark is equipped with a built-in set of over 80 high-level operators. You can use Spark interactively to query data from Scala, Python, R, and SQL shells.
3. It offers support for sophisticated analytics
Not only does Spark support simple “map” and “reduce” operations, but it also supports SQL queries, streaming data, and advanced analytics, including ML and graph algorithms. It comes with a powerful stack of libraries such as SQL & DataFrames and MLlib (for ML), GraphX, and?Spark Streaming. What’s fascinating is that Spark lets you combine the capabilities of all these libraries within a single workflow/application.
4. Real-time stream processing
Spark is designed to handle real-time data streaming. While MapReduce is built to handle and process the data that is already stored in Hadoop clusters, Spark can do both and also manipulate data in real-time via Spark Streaming.
Unlike other streaming solutions,?Spark Streaming can recover the lost work?and deliver the exact semantics out-of-the-box without requiring extra code or configuration. Plus, it also lets you reuse the same code for batch and stream processing and even for joining streaming data to historical data.
5. It is flexible
Spark can run independently in cluster mode, and it can also run on Hadoop YARN, Apache Mesos, Kubernetes, and even in the cloud. Furthermore, it can access diverse data sources. For instance, Spark can run on the YARN cluster manager and read any existing Hadoop data. It can read from any Hadoop data sources like HBase, HDFS, Hive, and Cassandra. This aspect of Spark makes it an ideal tool for migrating pure?Hadoop applications, provided the apps’ use-case is Spark-friendly.
6. Active and expanding community
Developers from?over 300 companies?have contributed to design and build Apache Spark. Ever since 2009, more than 1200 developers have actively contributed to making Spark what it is today! Naturally, Spark is backed by an active community of developers who work to improve its features and performance continually. To reach out to the Spark community, you can make use of?mailing lists?for any queries, and you can also attend Spark meetup groups and conferences.
Benefits of Apache Spark:
1. Speed:
When comes to Big Data, processing speed always matters. Apache Spark is wildly popular with data scientists because of its speed. Spark is 100x faster than Hadoop for large scale data processing. Apache Spark uses in-memory(RAM) computing system whereas Hadoop uses local memory space to store data. Spark can handle multiple petabytes of clustered data of more than 8000 nodes at a time.?
2. Ease of Use:
Apache Spark carries easy-to-use APIs for operating on large datasets. It offers over 80 high-level operators that make it easy to build parallel apps.
3. Advanced Analytics:
Spark not only supports ‘MAP’ and ‘reduce’. It also supports Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc.
4. Dynamic in Nature:
With Apache Spark, you can easily develop parallel applications. Spark offers you over 80 high-level operators.
5. Multilingual:
Apache Spark supports many languages for code writing such as Python, Java, Scala, etc.
6. Apache Spark is powerful:
Apache Spark can handle many analytics challenges because of its low-latency in-memory data processing capability. It has well-built libraries for graph analytics algorithms and machine learning.
7. Increased access to Big data:
Apache Spark is opening up various opportunities for big data and making As per the recent survey conducted by IBM’s announced that it will educate more than 1 million data engineers and data scientists on Apache Spark.?
8. Demand for Spark Developers:
Apache Spark not only benefits your organization but you as well. Spark developers are so in-demand that companies offering attractive benefits and providing flexible work timings just to hire experts skilled in Apache Spark. As per?PayScale?the average salary for?Data Engineer with Apache Spark skills is $100,362. For people who want to make a career in the big data, technology can learn?Apache Spark. You will find various ways to bridge the skills gap for getting data-related jobs, but the best way is to take formal training which will provide you hands-on work experience and also learn through hands-on projects.
Cons of Apache Spark:
1. No automatic optimization process:
In the case of Apache Spark, you need to optimize the code manually since it doesn’t have any automatic code optimization process. This will turn into a disadvantage when all the other technologies and platforms are moving towards automation.
2. File Management System:
Apache Spark doesn’t come with its own file management system. It depends on some other platforms like Hadoop or other cloud-based platforms.
3. Fewer Algorithms:
There are fewer algorithms present in the case of Apache Spark Machine Learning Spark MLlib. It lags behind in terms of a number of available algorithms.
4. Small Files Issue:
One more reason to blame Apache Spark is the issue with small files. Developers come across issues of small files when using Apache Spark along with Hadoop. Hadoop Distributed File System (HDFS) provides a limited number of large files instead of a large number of small files.
5. Window Criteria:
Data in Apache Spark divides into small batches of a predefined time interval. So Apache won't support record-based window criteria. Rather, it offers time-based window criteria.
6. Doesn’t suit for a multi-user environment:
Yes, Apache Spark doesn’t fit for a multi-user environment. It is not capable of handling more users concurrency.
To sum up, in light of the good, the bad and the ugly, Spark is a conquering tool when we view it from outside. We have seen a drastic change in the performance and decrease in the failures across various projects executed in Spark. Many applications are being moved to Spark for the efficiency it offers to developers. Using Apache Spark can give any business a boost and help foster its growth. It is sure that you will also have a bright future!