How Learning Apache Spark Makes You a Better Data Scientist

How Learning Apache Spark Makes You a Better Data Scientist

Knowledge about Apache Spark is in particularly high demand from data engineers. This is at least my experience from teaching over 5,000 students how to pass the popular Databricks Spark Certification. But not only data engineers can benefit from knowing their ways around Apache Spark!

As a data scientist myself, I know that especially today and in the future, data scientists can benefit immensely from learning Apache Spark!

Read on to learn how you, as a data scientist, can benefit from leveraging one of today’s most powerful big-data processing frameworks!


Data Scientists with Spark Skills Earn More

For data scientists, Apache Spark skills mean higher salaries. In my research on Indeed.com, I found out that Spark-related data science roles earn 20% more than all data science roles. At least as measured by the median US base salary estimate in job postings on Indeed.com in May 2023.

Box plots showing how data scientists with spark skills earn 20% more than all data scientists in May 2023 according to Indeed.com estimates

The boxes in the plot represent the interquartile range of salary estimates, so all salaries between 25% and 75% of the total salary range. The median base salary for data science roles where Spark is mentioned is 130k US Dollars. For all data science jobs, it is just 109k US Dollars.

But this is not the only interesting thing about this statistic: The salary range for Spark-savvy data scientists is higher and narrower. From the plot, it is easy to see that data scientists with Spark skills catapult their earnings potential in a relatively high range in comparison to the overall data science population.

So, Spark skills pay off!


Spark Becomes Part of State-of-the-Art AI Solutions

Large library with many books with computers in the front demonstrating how Apache Spark is used for large language model use cases today
Spark is used in the context of Large Language Models (LLM)

A step change in NLP has recently kicked off the era of large-language models. So, what has Apache Spark to do with it? It turns out: Spark is running an important part of the party!

Flipping through the agenda of this year’s Data+AI Summit by Databricks , I was surprised to see a talk about “Scaling AI Applications with Databricks, Hugging Face and Pinecone”. Here, Roie Schwaber-Cohen , a Staff Developer Advocate at Pinecone , will talk about how Spark, as part of a dedicated tech stack, enables the efficient processing of billions of vector embeddings in a distributed fashion.

Here, Spark’s distributed data processing skills shine, as parallelization “can save many hours of precious computation time and resources”, according to Pinecone. Spark also leverages GPUs, enabling rapid unlocking of NLP use cases like semantic search and recommendation engines.

Spark is also applied to leverage distributed training of deep learning models. Via the Horovod framework originally developed at Uber , Spark can train TensorFlow, Keras, or PyTorch models in a distributed fashion on a cluster of GPUs. At the upcoming Data+AI Summit, Raja Lanka of 摩根士丹利 and Ryan Kennedy of Databricks will shed some light on how Spark on Databricks enabled customer engagement via deep learning at scale.

The use of Apache Spark in state-of-the-art machine learning business solutions shows that it is a relevant and valuable skill for mastering what is ahead in data science.


Spark Enables Future-Ready Real-Time Data Use Cases

Fast-running watch depicting the approach of fast-moving real-time data in the data world as predicted by McKinsey
Is your data science ready for real-time data?

QuantumBlack, AI by McKinsey estimates that already in 2025, data-driven corporations process and deliver data in real-time. Think back on your career as a data scientist so far: How many models have you trained and continuously re-trained on real-time data? Have you ever worked with real-time data in your favorite tools, e.g. Pandas?

My point is: Based on my subjective assessment of the state of data science today, very few of us data scientists have had to deal with true live data. But, according to McKinsey, it is coming our way. Already today, data storage frameworks like Delta Lake natively support timestamped data that can be easily digested into data science use cases. While the real-time wave is originating in the data engineering context, it is not unreasonable to assume that it will trigger additional demand for real-time data experience for data scientists.

Where is Apache Spark coming into the picture? Delta Lake is a storage layer that enables parallel data processing with high integrity on data lakes. This storage layer is built in tight integration with the structured streaming component of Apache Spark. Spark’s structured streaming component enables running typical data analysis (and data science) operations, like windowed mean functions, on real-time data. Meaning that integrating Apache Spark with Delta Lake makes future-proof machine learning on real-time data possible.

How exciting!


Data Scientists at the Crest of the Next Data Wave Benefit from Spark Skills

Data scientist surfing on a blue wave in the ocean
Is this a data scientist with Spark skills riding the next data wave?

When I reflect on what I see in today’s corporate data landscape against McKinsey’s projections, I can’t help but think that as data scientists, we are not ready to serve the new real-time world in which data is embedded in every decision just yet. As the unfurling of updated corporate data strategies lingers in the next few years, I suspect that their implementation might be at risk due to yet another war for data talent.

According to Chip Huyen , who has investigated real-time machine learning across different industries, "Real-time machine learning is largely an infrastructure problem. Solving it will require the data science/ML team and the platform team to work together."

So, where do data scientists with Spark skills fit in? Based on the information I presented above, here is my guess:

As data teams work together to enable the high-value, state-of-the-art, real-time machine learning use cases of the future, data scientists with Spark skills will be the scarce resource building the bridge between data science and data engineering.

What do you think will be the future of data scientists with Apache Spark skills? I am looking forward to our discussion in the comments.


Learn Apache Spark and pass the Databricks Certified Associate Developer for Apache Spark Certification with this course

Do you want to get ready to provide highly rewarded value as a data scientist in the next data wave? I encourage you to check out my easy-to-understand, fun, and engaging Apache Spark course which will transform you into a big data professional:

Check out SparkCertCourse.com now!


#pyspark #dataengineering #datascience #bigdata

要查看或添加评论,请登录

Florian Roscheck的更多文章

社区洞察

其他会员也浏览了