Anybody Can Learn Spark

Anybody Can Learn Spark

Lots of people want to explore but most of them are not aware that Spark is ultra easy to learn & work with if you follow a well curated approach. Let me start by answering few very basic questions on your behalf which has made you to click this article.

Can I also learn Spark ?

Yes, Anybody can learn it. Whosoever you are , a software testing professional, a DBA, an ETL developer, a software programmer of any language or a fresh college grad.

How much time will it take me to learn Spark ?

Well, this is applicable to every technology not only Spark alone. You commit 20 hours of your time towards reading blogs / tutorials / watching videos etc & 20 hours of practicing ( Hand Coding ). I am sure you will be in a fantastic position if you manage to do that.

Can I call myself a Spark Developer just after spending 40 hours learning it ?

Yes, absolutely you can call yourself a spark developer because now you have started talking about spark & started building data pipelines in spark. At this point you may be a bad spark developer but that's the case with everybody. Your first 1000 lines of code will be disaster, next 1000 lines will be okayies & then rest will be history. Nobody can write a bad code after writing 10000+ lines. Are you that bad ? No, right ?

Ok but What about those who says you have to spend at least 6 months to learn Spark or may be more than that ?

They are also correct but why to wait until we become perfect. We start with 10-15% expertise and add on to it as we progress by each day, by each week & each month. Technologies are moving with a rapid pace & so is our life & new people are getting added into the system with each passing year. We can't afford to just wait & watch. We just need to learn quickly & put our neck right in the middle to face the real challenges & trust me things will be easy when we overcome the initial inertia.

Correct but what about the other buzz words around Spark like Hadoop, different programming languages ( Scala / Python ), SQL, Streaming Sources, this & that ?

:) Yeah these all be the part of our learning process. Remember we are investing 40 hours of initial learning. 40 Hours will give you significantly good amount of knowledge what is what & What to learn , What not to learn. Just keep this thing in mind that learning everything at one go not necessary. We can start with "just enough" concept and learn only the things which are necessary for us at the start.

Ok, But still, I have never coded anything in Python / Scala. I think, it will be extremely difficult for me.

As I mentioned your first 1000 lines will be absolutely garbage. You will get scared, annoyed & you will be laughing at yourself but if you come out of that phase then you are own your way to become a spark developer. Just remember when you start doing just a small breathing exercise, you struggle to do for a min but if you are consistent then 30 mins are also not enough for you after certain number of days.

let me give you an example where we try to read a CSV file using Spark .

csvFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv"

df = spark.read.csv(csvFile)

If you are not scared with the above code then continue to read the article. ( Very easy right ? )

Ok, I am convince. Now please tell me how to start ?

Yeah, Thats good. Now lots of people will tell you lots of things including myself. Just follow one tutorial end to end. That will give you a decent start. let me take the liberty & try to give you how to start step by step.

Step1: Create the Free Community Edition Account from Databricks ( Databricks is the company owned by the Apache Spark Creators )

Here is the link -

Step2: Follow the Databricks Documentation. It has literally everything what you need to learn.

Here is the link -

Step3: Choose one programming language to learn Spark. If you don't know any programming language then Python will be a great choice. The learning curve of python is very minimal. You can learn basic python from W3 School but don't spend too much time on it. Things will come as you progress

Here is the link -

Step4: The first video, You should watch is by Sameer Farooqui. This is a very old video but this is just Gold. I have never seen anybody explained Spark so amazingly. Some of the old concepts may not be applicable today but still you have to watch it at least once if not twice.

Here is the link -


Step5: Keep enrolling yourself for the free trainings conducted by Databricks by visiting there website & LinkedIn page. Also every now & then go through the already conducted webinars. Ofcourse at this point just watch the basic ones. These are really really good.

Here is the link:

Step6: Don't forget the check the official Spark page.

Here is the link:

Ok, If you manage to follow all of these things step by step, I guarantee that you will be on your way to become a fantastic Data Engineer on Apache Spark with Databricks.

That means, at least for the next 5 years you will be good because Databricks ( Spark ) have massive job prospects & I see lots of organisation will be moving their on-prem solutions into cloud on Databricks. Also spark professionals high in demand with moderate to high salary prospects.

Best of luck in your learning journey !!

Now, let me do little bit of self advertising :)

I have some articles written on Spark ( From Basic to Advanced ) which are available on my LinkedIn profile & Do let me know your feedback in the comment section.

Here are the links:



My Page on Spark:

Also I have a Page where I post lots of code snippets & interesting information regarding Databricks , PySpark and all other relevant technologies associated with Spark. Please follow it.

Here is the link:

Also if you are interested to join any course on Apache Spark / Databricks then I conduct the End to End Data Engineering Training Course on Apache Spark ( On the Databricks platform with  PySpark ) which you can enrol by directly reaching out to me. I will share the detailed course contents which will have the details of what all things, I will be covering during 3 weeks of live classes.

Best of Luck in your learning goal !!

Please like / share , If you find my article useful. Please share your feedback in the comment section.


Thanks for the information. The Spark video seems 8y old. Could you please suggest any new one.

回复
Amar Jaybhay

Senior Data Engineer | Pyspark | AWS | Azure | GCP |GenAI | ML

2 年

really useful information

回复
Yash Bhawsar

Principal Engineer@Invesco | Data Engineering & Analytics Unlocking Data Potential for Business Growth

2 年

Very insightful ? Thanks for sharing Deepak ??

回复
Chandra B Bakshi

@Amdocs |1x AWS | 2x OCA | Certified SAFe 5 PO/PM

3 年

Really informative

回复
Manoranjan Mishra

Enterprise Data Architect at Graco

3 年

Mr Depak !! You are toooooo good man ! Keep it up ! very well written and motivating one !! I'm becoming a big fan of yours !!

要查看或添加评论,请登录

Deepak Rajak的更多文章

  • Multi Tasks Job in Databricks

    Multi Tasks Job in Databricks

    A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

    3 条评论
  • Deploying Databricks on Azure

    Deploying Databricks on Azure

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    9 条评论
  • Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

    10 条评论
  • Create Tables in Databricks & Query it from AWS Athena

    Create Tables in Databricks & Query it from AWS Athena

    In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

    2 条评论
  • AWS Glue Data Catalog as the Metastore for Databricks

    AWS Glue Data Catalog as the Metastore for Databricks

    We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

    10 条评论
  • Deploying Databricks on AWS

    Deploying Databricks on AWS

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    1 条评论
  • Danny's Diner Case Study using Pyspark on Databricks

    Danny's Diner Case Study using Pyspark on Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

    9 条评论
  • Azure Cloud Data Engineering

    Azure Cloud Data Engineering

    You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

    22 条评论
  • Deploying Databricks on Google Cloud Platform

    Deploying Databricks on Google Cloud Platform

    Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

    4 条评论
  • CI / CD in Azure Databricks using Azure DevOps

    CI / CD in Azure Databricks using Azure DevOps

    In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

    19 条评论

社区洞察

其他会员也浏览了