登录查看更多内容

How Learning Apache Spark Makes You a Better Data Scientist

Florian Roscheck

Sr. Data Scientist at Henkel | Teams. Data. Science. Products. | Boosting business through data science for the sustainable good of people.

发布日期: 2023年6月7日

Knowledge about Apache Spark is in particularly high demand from data engineers. This is at least my experience from teaching over 5,000 students how to pass the popular Databricks Spark Certification. But not only data engineers can benefit from knowing their ways around Apache Spark!

As a data scientist myself, I know that especially today and in the future, data scientists can benefit immensely from learning Apache Spark!

Read on to learn how you, as a data scientist, can benefit from leveraging one of today’s most powerful big-data processing frameworks!

Data Scientists with Spark Skills Earn More

For data scientists, Apache Spark skills mean higher salaries. In my research on Indeed.com, I found out that Spark-related data science roles earn 20% more than all data science roles. At least as measured by the median US base salary estimate in job postings on Indeed.com in May 2023.

Box plots showing how data scientists with spark skills earn 20% more than all data scientists in May 2023 according to Indeed.com estimates

The boxes in the plot represent the interquartile range of salary estimates, so all salaries between 25% and 75% of the total salary range. The median base salary for data science roles where Spark is mentioned is 130k US Dollars. For all data science jobs, it is just 109k US Dollars.

But this is not the only interesting thing about this statistic: The salary range for Spark-savvy data scientists is higher and narrower. From the plot, it is easy to see that data scientists with Spark skills catapult their earnings potential in a relatively high range in comparison to the overall data science population.

So, Spark skills pay off!

Spark Becomes Part of State-of-the-Art AI Solutions

Large library with many books with computers in the front demonstrating how Apache Spark is used for large language model use cases today — Spark is used in the context of Large Language Models (LLM)

A step change in NLP has recently kicked off the era of large-language models. So, what has Apache Spark to do with it? It turns out: Spark is running an important part of the party!

Flipping through the agenda of this year’s Data+AI Summit by Databricks , I was surprised to see a talk about “Scaling AI Applications with Databricks, Hugging Face and Pinecone”. Here, Roie Schwaber-Cohen , a Staff Developer Advocate at Pinecone , will talk about how Spark, as part of a dedicated tech stack, enables the efficient processing of billions of vector embeddings in a distributed fashion.

Here, Spark’s distributed data processing skills shine, as parallelization “can save many hours of precious computation time and resources”, according to Pinecone. Spark also leverages GPUs, enabling rapid unlocking of NLP use cases like semantic search and recommendation engines.

Spark is also applied to leverage distributed training of deep learning models. Via the Horovod framework originally developed at Uber , Spark can train TensorFlow, Keras, or PyTorch models in a distributed fashion on a cluster of GPUs. At the upcoming Data+AI Summit, Raja Lanka of 摩根士丹利 and Ryan Kennedy of Databricks will shed some light on how Spark on Databricks enabled customer engagement via deep learning at scale.

The use of Apache Spark in state-of-the-art machine learning business solutions shows that it is a relevant and valuable skill for mastering what is ahead in data science.

领英推荐

Software Engineer to Data Scientist

Abhishek Vijayvargia 10 个月前

Machine Learning Engineer, Data Scientist – top…

Gregory Piatetsky-Shapiro 7 年前

Why should you join Learnbay to learn Data Science if…

Shanti A 3 年前

Spark Enables Future-Ready Real-Time Data Use Cases

Fast-running watch depicting the approach of fast-moving real-time data in the data world as predicted by McKinsey — Is your data science ready for real-time data?

QuantumBlack, AI by McKinsey estimates that already in 2025, data-driven corporations process and deliver data in real-time. Think back on your career as a data scientist so far: How many models have you trained and continuously re-trained on real-time data? Have you ever worked with real-time data in your favorite tools, e.g. Pandas?

My point is: Based on my subjective assessment of the state of data science today, very few of us data scientists have had to deal with true live data. But, according to McKinsey, it is coming our way. Already today, data storage frameworks like Delta Lake natively support timestamped data that can be easily digested into data science use cases. While the real-time wave is originating in the data engineering context, it is not unreasonable to assume that it will trigger additional demand for real-time data experience for data scientists.

Where is Apache Spark coming into the picture? Delta Lake is a storage layer that enables parallel data processing with high integrity on data lakes. This storage layer is built in tight integration with the structured streaming component of Apache Spark. Spark’s structured streaming component enables running typical data analysis (and data science) operations, like windowed mean functions, on real-time data. Meaning that integrating Apache Spark with Delta Lake makes future-proof machine learning on real-time data possible.

How exciting!

Data Scientists at the Crest of the Next Data Wave Benefit from Spark Skills

Data scientist surfing on a blue wave in the ocean — Is this a data scientist with Spark skills riding the next data wave?

When I reflect on what I see in today’s corporate data landscape against McKinsey’s projections, I can’t help but think that as data scientists, we are not ready to serve the new real-time world in which data is embedded in every decision just yet. As the unfurling of updated corporate data strategies lingers in the next few years, I suspect that their implementation might be at risk due to yet another war for data talent.

According to Chip Huyen , who has investigated real-time machine learning across different industries, "Real-time machine learning is largely an infrastructure problem. Solving it will require the data science/ML team and the platform team to work together."

So, where do data scientists with Spark skills fit in? Based on the information I presented above, here is my guess:

As data teams work together to enable the high-value, state-of-the-art, real-time machine learning use cases of the future, data scientists with Spark skills will be the scarce resource building the bridge between data science and data engineering.

What do you think will be the future of data scientists with Apache Spark skills? I am looking forward to our discussion in the comments.

Learn Apache Spark and pass the Databricks Certified Associate Developer for Apache Spark Certification with this course

Do you want to get ready to provide highly rewarded value as a data scientist in the next data wave? I encourage you to check out my easy-to-understand, fun, and engaging Apache Spark course which will transform you into a big data professional:

Check out SparkCertCourse.com now!

#pyspark #dataengineering #datascience #bigdata

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

查看更多评论

要查看或添加评论，请登录

Florian Roscheck的更多文章

Scaling Up Data Science – Boosting Business Through Data Science Series

2023年12月19日

Scaling Up Data Science – Boosting Business Through Data Science Series

Are you tired of seeing corporate data science initiatives fail? In this 3-article series, I summarize my experience…
Starting Slow And Scaling Sustainably – Boosting Business Through Data Science Series

2023年12月12日

Starting Slow And Scaling Sustainably – Boosting Business Through Data Science Series

Are you tired of seeing corporate data science initiatives fail? In this 3-article series, I summarize my experience…
Identifying Data Science Use Cases – Boosting Business Through Data Science Series

2023年12月5日

Identifying Data Science Use Cases – Boosting Business Through Data Science Series

Are you tired of seeing corporate data science initiatives fail? In this 3-article series, I summarize my experience…

3 条评论

How Learning Apache Spark Makes You a Better Data Scientist

Florian Roscheck

Sr. Data Scientist at Henkel | Teams. Data. Science. Products. | Boosting business through data science for the sustainable good of people.

Data Scientists with Spark Skills Earn More

Spark Becomes Part of State-of-the-Art AI Solutions

领英推荐

Spark Enables Future-Ready Real-Time Data Use Cases

Data Scientists at the Crest of the Next Data Wave Benefit from Spark Skills

Florian Roscheck的更多文章

社区洞察

其他会员也浏览了

What Big Data, Data Science, Deep Learning software goes together?

Data Science & Analytics: Navigating the Modern Landscape

How to Become a Data Scientist: A Comprehensive Guide

Topic: Enhancing Performance in PySpark with Vectorized Operations: pandas_udf vs Standard UDF....

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

The Comprehensive Guide to Starting a Career in Data Scientists

ML in Two Worlds - Scikit-Learn vs PySpark (Blog-2)

ONLEI Technologies Data Science Course with Job Guarantee

Unveiling PySpark's Hidden Powers: A Deep Dive into Performance Optimization and Spark Internals!

How to effectively demonstrate your expertise in data science or software engineering

Data Scientists with Spark Skills Earn More

Spark Becomes Part of State-of-the-Art AI Solutions

领英推荐

Spark Enables Future-Ready Real-Time Data Use Cases

Data Scientists at the Crest of the Next Data Wave Benefit from Spark Skills

Florian Roscheck的更多文章

Scaling Up Data Science – Boosting Business Through Data Science Series

Starting Slow And Scaling Sustainably – Boosting Business Through Data Science Series

Identifying Data Science Use Cases – Boosting Business Through Data Science Series

社区洞察

其他会员也浏览了

What Big Data, Data Science, Deep Learning software goes together?

Data Science & Analytics: Navigating the Modern Landscape

How to Become a Data Scientist: A Comprehensive Guide

Topic: Enhancing Performance in PySpark with Vectorized Operations: pandas_udf vs Standard UDF....

LinkedIn Open Sources a Small Component to Simplify the TensorFlow-Spark Interoperability

The Comprehensive Guide to Starting a Career in Data Scientists

ML in Two Worlds - Scikit-Learn vs PySpark (Blog-2)

ONLEI Technologies Data Science Course with Job Guarantee

Unveiling PySpark's Hidden Powers: A Deep Dive into Performance Optimization and Spark Internals!

How to effectively demonstrate your expertise in data science or software engineering