Anybody Can Learn Spark
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
Lots of people want to explore but most of them are not aware that Spark is ultra easy to learn & work with if you follow a well curated approach. Let me start by answering few very basic questions on your behalf which has made you to click this article.
Can I also learn Spark ?
Yes, Anybody can learn it. Whosoever you are , a software testing professional, a DBA, an ETL developer, a software programmer of any language or a fresh college grad.
How much time will it take me to learn Spark ?
Well, this is applicable to every technology not only Spark alone. You commit 20 hours of your time towards reading blogs / tutorials / watching videos etc & 20 hours of practicing ( Hand Coding ). I am sure you will be in a fantastic position if you manage to do that.
Can I call myself a Spark Developer just after spending 40 hours learning it ?
Yes, absolutely you can call yourself a spark developer because now you have started talking about spark & started building data pipelines in spark. At this point you may be a bad spark developer but that's the case with everybody. Your first 1000 lines of code will be disaster, next 1000 lines will be okayies & then rest will be history. Nobody can write a bad code after writing 10000+ lines. Are you that bad ? No, right ?
Ok but What about those who says you have to spend at least 6 months to learn Spark or may be more than that ?
They are also correct but why to wait until we become perfect. We start with 10-15% expertise and add on to it as we progress by each day, by each week & each month. Technologies are moving with a rapid pace & so is our life & new people are getting added into the system with each passing year. We can't afford to just wait & watch. We just need to learn quickly & put our neck right in the middle to face the real challenges & trust me things will be easy when we overcome the initial inertia.
Correct but what about the other buzz words around Spark like Hadoop, different programming languages ( Scala / Python ), SQL, Streaming Sources, this & that ?
:) Yeah these all be the part of our learning process. Remember we are investing 40 hours of initial learning. 40 Hours will give you significantly good amount of knowledge what is what & What to learn , What not to learn. Just keep this thing in mind that learning everything at one go not necessary. We can start with "just enough" concept and learn only the things which are necessary for us at the start.
Ok, But still, I have never coded anything in Python / Scala. I think, it will be extremely difficult for me.
As I mentioned your first 1000 lines will be absolutely garbage. You will get scared, annoyed & you will be laughing at yourself but if you come out of that phase then you are own your way to become a spark developer. Just remember when you start doing just a small breathing exercise, you struggle to do for a min but if you are consistent then 30 mins are also not enough for you after certain number of days.
let me give you an example where we try to read a CSV file using Spark .
csvFile = "/mnt/training/wikipedia/pageviews/pageviews_by_second.tsv" df = spark.read.csv(csvFile)
If you are not scared with the above code then continue to read the article. ( Very easy right ? )
Ok, I am convince. Now please tell me how to start ?
Yeah, Thats good. Now lots of people will tell you lots of things including myself. Just follow one tutorial end to end. That will give you a decent start. let me take the liberty & try to give you how to start step by step.
Step1: Create the Free Community Edition Account from Databricks ( Databricks is the company owned by the Apache Spark Creators )
Here is the link -
Step2: Follow the Databricks Documentation. It has literally everything what you need to learn.
Here is the link -
Step3: Choose one programming language to learn Spark. If you don't know any programming language then Python will be a great choice. The learning curve of python is very minimal. You can learn basic python from W3 School but don't spend too much time on it. Things will come as you progress
Here is the link -
Step4: The first video, You should watch is by Sameer Farooqui. This is a very old video but this is just Gold. I have never seen anybody explained Spark so amazingly. Some of the old concepts may not be applicable today but still you have to watch it at least once if not twice.
Here is the link -
Step5: Keep enrolling yourself for the free trainings conducted by Databricks by visiting there website & LinkedIn page. Also every now & then go through the already conducted webinars. Ofcourse at this point just watch the basic ones. These are really really good.
Here is the link:
Step6: Don't forget the check the official Spark page.
Here is the link:
Ok, If you manage to follow all of these things step by step, I guarantee that you will be on your way to become a fantastic Data Engineer on Apache Spark with Databricks.
That means, at least for the next 5 years you will be good because Databricks ( Spark ) have massive job prospects & I see lots of organisation will be moving their on-prem solutions into cloud on Databricks. Also spark professionals high in demand with moderate to high salary prospects.
Best of luck in your learning journey !!
Now, let me do little bit of self advertising :)
I have some articles written on Spark ( From Basic to Advanced ) which are available on my LinkedIn profile & Do let me know your feedback in the comment section.
Here are the links:
My Page on Spark:
Also I have a Page where I post lots of code snippets & interesting information regarding Databricks , PySpark and all other relevant technologies associated with Spark. Please follow it.
Here is the link:
Also if you are interested to join any course on Apache Spark / Databricks then I conduct the End to End Data Engineering Training Course on Apache Spark ( On the Databricks platform with PySpark ) which you can enrol by directly reaching out to me. I will share the detailed course contents which will have the details of what all things, I will be covering during 3 weeks of live classes.
Best of Luck in your learning goal !!
Please like / share , If you find my article useful. Please share your feedback in the comment section.
--
1 年Thanks for the information. The Spark video seems 8y old. Could you please suggest any new one.
Senior Data Engineer | Pyspark | AWS | Azure | GCP |GenAI | ML
2 年really useful information
Principal Engineer@Invesco | Data Engineering & Analytics Unlocking Data Potential for Business Growth
2 年Very insightful ? Thanks for sharing Deepak ??
@Amdocs |1x AWS | 2x OCA | Certified SAFe 5 PO/PM
3 年Really informative
Enterprise Data Architect at Graco
3 年Mr Depak !! You are toooooo good man ! Keep it up ! very well written and motivating one !! I'm becoming a big fan of yours !!