Spark: The most popular big data processing framework

Here is my another article related to big data and cloud technologies. In this article, I am going to talk about the big data distributed processing framework Spark. There are many materials available on the internet about Spark architecture and internal working , so I will not go into those details. I will highlight why Spark is so popular, what are the key considerations and challenges if you want to use it.

After Python and SQL, Spark is my most favorite technology and I really like using it for the data processing workloads. Using the power of Python and Spark I was able to solve very challenging issue my client was facing. And, after that I started liking Spark. This article is purely based on my own experience and views are personal.

Before Spark, Hadoop's Mapreduce use to be dominating the processing framework for big data workloads. But, Mapreduce has few limitations, such as it is disk based and operates in batch mode. These limitations paved way for Spark. Spark is memory based (it uses disk also) and supports batch as well as real time processing/analytics. Spark in memory computation is 100X and disk computation is 10X faster than Mapreduce. Spark's performance is really good. If you are using dataframe, lot of optimization is done by Spark internally (optimizer called Catalyst).You will be able to process millions of records within seconds. You will be amazed with the speed of Spark. Many people say that Spark has lightening speed and I agree, this is not exaggerated.

Please note one important point that Spark is also part of the Hadoop ecosystem. It is a data processing framework and doesn't have its own storage. You can use other Hadoop's frameworks with Spark to enhance the capability and use effectively such as HDFS for storage, Zookeeper for synchronization service and YARN for resource management.

Spark is a open source framework and it has huge community support . Spark library is getting richer everyday. Spark core APIs are available in Java, Scala, R and Python. You can choose your favorite programming language's API to work with Spark. I am Python person, so I use Python API called Pyspark. Additionally, Spark has service APIs such as GraphX, MLLib , SparkSQL and Streaming.

Spark main data abstract is called resilient distributed dataset (RDD). Later, on top of RDD ,Dataframe and Dataset abstracts were released. Dataframe is similar to table and most used. Dataframe has schema definition (similar to DDL). Dataframe makes coding easy as you feel like you are working with the tables. If you are from data warehousing background, you will like it. If you prefer to work using SQL, just register datafarme as table and write Spark SQL to manipulate and transform the data. That means you need some basic knowledge of the programming language. But, if you want fully exploit the power of Spark, expertise in Python/Scala/Scala (any one of them) will be very useful. This brings me to the conclusion that if you combine the power of SQL and programming language, you can solve very complex problems and achieve wonderful results.

For every job, Spark driver builds a logical flow of operations that can be represented in a graph which is directed and acyclic, called DAG (Directed Acyclic Graph). This gives Spark ability to recover from failure. That is why Spark is fault-tolrent.

Spark supports user defined functions (UDF) . You don't have always be dependent on the standard library methods. You can write your own custom functions using Python, Scala, Java etc and call them to transform the data. This gives you ability to develop own functions/libraries to achieve complex functionality. If you are dealing with a dynamic, semi structured and unstructured data, UDFs are very powerful tool to handle them. Programming language gives you immense power to control the data transformation, manipulations and flow while working with the Spark jobs.

We can conclude that the performance/speed, ability to handle variety of data and community support make Spark a very popular choice for the data processing.

But wait, everything is not greener on the other side of fence. There are many things you need to take care while working with Spark. It is like a kitchen which has all the raw materials available but you need to choose right ingredients to cook a tasty food. If right ingredients are not picked, the food may taste terrible. You should know how to distribute data evenly, allocate proper memory , allocate right number of executors, and align right number of cores to executors. Spark has many configuration variables/parameters and those need to set/adjusted properly. If they are not , there will be many performance issues/errors and you may have tough time to fix them.

You enjoy driving a manual car more than automatic if you know how to properly co-ordinate between clutch, brake and gear. Small mistake in co-ordination can switch off the car engine. Spark is also similar to the manual car. If you know how to have right co-ordination between partition,memory, executor and core, you will enjoy working with it.

I will end this article here now . What is your take on Spark? Do let me know in comment section. Happy Spark coding and learning!!!

Soumitri Padhy

Application Management | Governance | Data Quality |Cloud Delivery |Dev Ops | Data & Analytics| GEN AI enthusiastic

4 年

Thanks for sharing Gopal .. Informative .

回复
Raja Boopalan Koushik

Data Governance Manager

4 年

I am quite keen on Snowflake and it’s capabilities. Looking forward for your take on it ????

回复
Gopal Kumar Roy

Solution Architect

4 年

Sathish, you are bang on ??

Sathish Ksheersagar

Data & Analytics | Leadership | Strategy & Road-map | Healthcare | Manufacturing | Architecture & Governance | Cloud Technology | Engineering | Data Science | Data Products | App & Platform SRE | Dev-ops | Automation

4 年

Rightly Choosen topics,next hope to see on Snowflake

回复
Sathish Ksheersagar

Data & Analytics | Leadership | Strategy & Road-map | Healthcare | Manufacturing | Architecture & Governance | Cloud Technology | Engineering | Data Science | Data Products | App & Platform SRE | Dev-ops | Automation

4 年

Well written Gopal ??

回复

要查看或添加评论,请登录

Gopal Kumar Roy的更多文章

  • NoSQL versus SQL Database

    NoSQL versus SQL Database

    I have been working with SQL and MPP databases since very long time. After working so long, I learnt depth and breadth…

    2 条评论
  • AWS Data Analytics - Specialty exam preparation tips

    AWS Data Analytics - Specialty exam preparation tips

    Last week I passed the AWS Data Analytics - Specialty exam and thought of sharing some of the tips that can be very…

    3 条评论
  • Airflow: ETL Workflow Management Platform

    Airflow: ETL Workflow Management Platform

    Airflow is getting very popular for the ETL workflow management (It can be used for other kind of workflow management…

    2 条评论
  • Snowflake: The cloud data warehouse solution with no modeling

    Snowflake: The cloud data warehouse solution with no modeling

    In this article, I am going to talk about the cloud based data warehouse solution Snowflake. I will deep dive into some…

    6 条评论
  • Why Python is top choice for Data Engineering

    Why Python is top choice for Data Engineering

    Python is one of the most popular programming language. Cloud, Big data and Machine Learning have made it very popular…

    1 条评论
  • Google's BigQuery: Strengths

    Google's BigQuery: Strengths

    Google's cloud offering GCP is increasing its footprint very rapidly. Specifically, GCP's data warehouse service…

    1 条评论
  • AWS Glue- Based on a data Engineer real life experience

    AWS Glue- Based on a data Engineer real life experience

    There is lot of buzz going around cloud technologies.Many organizations are moving to Cloud.

    6 条评论

社区洞察

其他会员也浏览了