登录查看更多内容

Spark: The most popular big data processing framework

Gopal Kumar Roy

Solution Architect

发布日期: 2020年9月17日

Here is my another article related to big data and cloud technologies. In this article, I am going to talk about the big data distributed processing framework Spark. There are many materials available on the internet about Spark architecture and internal working , so I will not go into those details. I will highlight why Spark is so popular, what are the key considerations and challenges if you want to use it.

After Python and SQL, Spark is my most favorite technology and I really like using it for the data processing workloads. Using the power of Python and Spark I was able to solve very challenging issue my client was facing. And, after that I started liking Spark. This article is purely based on my own experience and views are personal.

Before Spark, Hadoop's Mapreduce use to be dominating the processing framework for big data workloads. But, Mapreduce has few limitations, such as it is disk based and operates in batch mode. These limitations paved way for Spark. Spark is memory based (it uses disk also) and supports batch as well as real time processing/analytics. Spark in memory computation is 100X and disk computation is 10X faster than Mapreduce. Spark's performance is really good. If you are using dataframe, lot of optimization is done by Spark internally (optimizer called Catalyst).You will be able to process millions of records within seconds. You will be amazed with the speed of Spark. Many people say that Spark has lightening speed and I agree, this is not exaggerated.

Please note one important point that Spark is also part of the Hadoop ecosystem. It is a data processing framework and doesn't have its own storage. You can use other Hadoop's frameworks with Spark to enhance the capability and use effectively such as HDFS for storage, Zookeeper for synchronization service and YARN for resource management.

Spark is a open source framework and it has huge community support . Spark library is getting richer everyday. Spark core APIs are available in Java, Scala, R and Python. You can choose your favorite programming language's API to work with Spark. I am Python person, so I use Python API called Pyspark. Additionally, Spark has service APIs such as GraphX, MLLib , SparkSQL and Streaming.

Spark main data abstract is called resilient distributed dataset (RDD). Later, on top of RDD ,Dataframe and Dataset abstracts were released. Dataframe is similar to table and most used. Dataframe has schema definition (similar to DDL). Dataframe makes coding easy as you feel like you are working with the tables. If you are from data warehousing background, you will like it. If you prefer to work using SQL, just register datafarme as table and write Spark SQL to manipulate and transform the data. That means you need some basic knowledge of the programming language. But, if you want fully exploit the power of Spark, expertise in Python/Scala/Scala (any one of them) will be very useful. This brings me to the conclusion that if you combine the power of SQL and programming language, you can solve very complex problems and achieve wonderful results.

领英推荐

PySpark Introduction: Powering Big Data Processing…

Eduardo Miranda 7 个月前

Understanding the PySpark

Sumit Joshi 1 年前

WAT IS SPARK

Ashish Ranjan 1 年前

For every job, Spark driver builds a logical flow of operations that can be represented in a graph which is directed and acyclic, called DAG (Directed Acyclic Graph). This gives Spark ability to recover from failure. That is why Spark is fault-tolrent.

Spark supports user defined functions (UDF) . You don't have always be dependent on the standard library methods. You can write your own custom functions using Python, Scala, Java etc and call them to transform the data. This gives you ability to develop own functions/libraries to achieve complex functionality. If you are dealing with a dynamic, semi structured and unstructured data, UDFs are very powerful tool to handle them. Programming language gives you immense power to control the data transformation, manipulations and flow while working with the Spark jobs.

We can conclude that the performance/speed, ability to handle variety of data and community support make Spark a very popular choice for the data processing.

But wait, everything is not greener on the other side of fence. There are many things you need to take care while working with Spark. It is like a kitchen which has all the raw materials available but you need to choose right ingredients to cook a tasty food. If right ingredients are not picked, the food may taste terrible. You should know how to distribute data evenly, allocate proper memory , allocate right number of executors, and align right number of cores to executors. Spark has many configuration variables/parameters and those need to set/adjusted properly. If they are not , there will be many performance issues/errors and you may have tough time to fix them.

You enjoy driving a manual car more than automatic if you know how to properly co-ordinate between clutch, brake and gear. Small mistake in co-ordination can switch off the car engine. Spark is also similar to the manual car. If you know how to have right co-ordination between partition,memory, executor and core, you will enjoy working with it.

I will end this article here now . What is your take on Spark? Do let me know in comment section. Happy Spark coding and learning!!!

Soumitri Padhy

4 年

Thanks for sharing Gopal .. Informative .

Raja Boopalan Koushik

Data Governance Manager

4 年

I am quite keen on Snowflake and it’s capabilities. Looking forward for your take on it ????

Gopal Kumar Roy

Solution Architect

4 年

Sathish, you are bang on ??

1 次回应

Sathish Ksheersagar

4 年

Rightly Choosen topics,next hope to see on Snowflake

Sathish Ksheersagar

4 年

Well written Gopal ??

查看更多评论

要查看或添加评论，请登录

Gopal Kumar Roy的更多文章

NoSQL versus SQL Database

2021年6月5日

NoSQL versus SQL Database

I have been working with SQL and MPP databases since very long time. After working so long, I learnt depth and breadth…

2 条评论
AWS Data Analytics - Specialty exam preparation tips

2021年4月13日

AWS Data Analytics - Specialty exam preparation tips

Last week I passed the AWS Data Analytics - Specialty exam and thought of sharing some of the tips that can be very…

3 条评论
Airflow: ETL Workflow Management Platform

2020年10月14日

Airflow: ETL Workflow Management Platform

Airflow is getting very popular for the ETL workflow management (It can be used for other kind of workflow management…

2 条评论
Snowflake: The cloud data warehouse solution with no modeling

2020年10月1日

Snowflake: The cloud data warehouse solution with no modeling

In this article, I am going to talk about the cloud based data warehouse solution Snowflake. I will deep dive into some…

6 条评论
Why Python is top choice for Data Engineering

2020年9月8日

Why Python is top choice for Data Engineering

Python is one of the most popular programming language. Cloud, Big data and Machine Learning have made it very popular…

1 条评论
Google's BigQuery: Strengths

2020年8月30日

Google's BigQuery: Strengths

Google's cloud offering GCP is increasing its footprint very rapidly. Specifically, GCP's data warehouse service…

1 条评论
AWS Glue- Based on a data Engineer real life experience

2020年8月26日

AWS Glue- Based on a data Engineer real life experience

There is lot of buzz going around cloud technologies.Many organizations are moving to Cloud.

6 条评论

See all articles

Spark: The most popular big data processing framework

Gopal Kumar Roy

Solution Architect

领英推荐

Gopal Kumar Roy的更多文章

社区洞察

其他会员也浏览了

BigData Analytics with PySpark

Understanding Spark on YARN Architecture

WHAT IS SPARK

Unlocking the Power of Apache Spark: A Comprehensive Overview

Apache Spark

How to implement Apache Spark in Data Processing and Analytics?

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

MapReduce (and its legacy)

Spark - Managers' snapshot

领英推荐

Gopal Kumar Roy的更多文章

NoSQL versus SQL Database

AWS Data Analytics - Specialty exam preparation tips

Airflow: ETL Workflow Management Platform

Snowflake: The cloud data warehouse solution with no modeling

Why Python is top choice for Data Engineering

Google's BigQuery: Strengths

AWS Glue- Based on a data Engineer real life experience

社区洞察

其他会员也浏览了

BigData Analytics with PySpark

Understanding Spark on YARN Architecture

WHAT IS SPARK

Unlocking the Power of Apache Spark: A Comprehensive Overview

Apache Spark

How to implement Apache Spark in Data Processing and Analytics?

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

MapReduce (and its legacy)

Spark - Managers' snapshot