Spark: The most popular big data processing framework
Here is my another article related to big data and cloud technologies. In this article, I am going to talk about the big data distributed processing framework Spark. There are many materials available on the internet about Spark architecture and internal working , so I will not go into those details. I will highlight why Spark is so popular, what are the key considerations and challenges if you want to use it.
After Python and SQL, Spark is my most favorite technology and I really like using it for the data processing workloads. Using the power of Python and Spark I was able to solve very challenging issue my client was facing. And, after that I started liking Spark. This article is purely based on my own experience and views are personal.
Before Spark, Hadoop's Mapreduce use to be dominating the processing framework for big data workloads. But, Mapreduce has few limitations, such as it is disk based and operates in batch mode. These limitations paved way for Spark. Spark is memory based (it uses disk also) and supports batch as well as real time processing/analytics. Spark in memory computation is 100X and disk computation is 10X faster than Mapreduce. Spark's performance is really good. If you are using dataframe, lot of optimization is done by Spark internally (optimizer called Catalyst).You will be able to process millions of records within seconds. You will be amazed with the speed of Spark. Many people say that Spark has lightening speed and I agree, this is not exaggerated.
Please note one important point that Spark is also part of the Hadoop ecosystem. It is a data processing framework and doesn't have its own storage. You can use other Hadoop's frameworks with Spark to enhance the capability and use effectively such as HDFS for storage, Zookeeper for synchronization service and YARN for resource management.
Spark is a open source framework and it has huge community support . Spark library is getting richer everyday. Spark core APIs are available in Java, Scala, R and Python. You can choose your favorite programming language's API to work with Spark. I am Python person, so I use Python API called Pyspark. Additionally, Spark has service APIs such as GraphX, MLLib , SparkSQL and Streaming.
Spark main data abstract is called resilient distributed dataset (RDD). Later, on top of RDD ,Dataframe and Dataset abstracts were released. Dataframe is similar to table and most used. Dataframe has schema definition (similar to DDL). Dataframe makes coding easy as you feel like you are working with the tables. If you are from data warehousing background, you will like it. If you prefer to work using SQL, just register datafarme as table and write Spark SQL to manipulate and transform the data. That means you need some basic knowledge of the programming language. But, if you want fully exploit the power of Spark, expertise in Python/Scala/Scala (any one of them) will be very useful. This brings me to the conclusion that if you combine the power of SQL and programming language, you can solve very complex problems and achieve wonderful results.
For every job, Spark driver builds a logical flow of operations that can be represented in a graph which is directed and acyclic, called DAG (Directed Acyclic Graph). This gives Spark ability to recover from failure. That is why Spark is fault-tolrent.
Spark supports user defined functions (UDF) . You don't have always be dependent on the standard library methods. You can write your own custom functions using Python, Scala, Java etc and call them to transform the data. This gives you ability to develop own functions/libraries to achieve complex functionality. If you are dealing with a dynamic, semi structured and unstructured data, UDFs are very powerful tool to handle them. Programming language gives you immense power to control the data transformation, manipulations and flow while working with the Spark jobs.
We can conclude that the performance/speed, ability to handle variety of data and community support make Spark a very popular choice for the data processing.
But wait, everything is not greener on the other side of fence. There are many things you need to take care while working with Spark. It is like a kitchen which has all the raw materials available but you need to choose right ingredients to cook a tasty food. If right ingredients are not picked, the food may taste terrible. You should know how to distribute data evenly, allocate proper memory , allocate right number of executors, and align right number of cores to executors. Spark has many configuration variables/parameters and those need to set/adjusted properly. If they are not , there will be many performance issues/errors and you may have tough time to fix them.
You enjoy driving a manual car more than automatic if you know how to properly co-ordinate between clutch, brake and gear. Small mistake in co-ordination can switch off the car engine. Spark is also similar to the manual car. If you know how to have right co-ordination between partition,memory, executor and core, you will enjoy working with it.
I will end this article here now . What is your take on Spark? Do let me know in comment section. Happy Spark coding and learning!!!
Application Management | Governance | Data Quality |Cloud Delivery |Dev Ops | Data & Analytics| GEN AI enthusiastic
4 年Thanks for sharing Gopal .. Informative .
Data Governance Manager
4 年I am quite keen on Snowflake and it’s capabilities. Looking forward for your take on it ????
Solution Architect
4 年Sathish, you are bang on ??
Data & Analytics | Leadership | Strategy & Road-map | Healthcare | Manufacturing | Architecture & Governance | Cloud Technology | Engineering | Data Science | Data Products | App & Platform SRE | Dev-ops | Automation
4 年Rightly Choosen topics,next hope to see on Snowflake
Data & Analytics | Leadership | Strategy & Road-map | Healthcare | Manufacturing | Architecture & Governance | Cloud Technology | Engineering | Data Science | Data Products | App & Platform SRE | Dev-ops | Automation
4 年Well written Gopal ??