What's New in the Second Decade of Spark?
My company, Databricks, got its start by taking Spark from an academic research effort to an enterprise grade product. Spark solves the problem of processing data analytics at scale, working with huge datasets, which can be applied to data engineering, data science and machine learning tasks. With Spark these workloads are no longer limited to a single machine but can be run on large clusters of machines to achieve impressive performance at scale.
Spark is more than 10 years old, and has gone through some major revisions over the years. Spark usage grew a lot as it was adopted by Hadoop users, to accelerate their analytics. Now, we see those workloads moving from on-prem systems to the cloud, and Spark is stronger than ever.? Last year Spark 3.0 was announced, and last month 3.2. What’s new in Spark 3? Some features are visible to users, and some are ‘under the hood’. I’d like to briefly share with you my top three, covering dataframes, SQL and ease of use.
Python is the most popular language for data scientists, and with Spark 3.0 Python is now the most widely used language on Spark. Data scientists are comfortable working with dataframes for processing structured data, and on smaller datasets it is common to use such tools as R, and Pandas in Python. With Spark 3.2, the Pandas API has been added, and Python users can now scale out their Pandas applications on Spark by changing just one line of code.?
Spark SQL was the single component that had the most changes and performance improvements in the Spark 3.0 release. While many data systems support some form of SQL, these platforms often had custom syntax, especially for Hadoop based SQL tools. These differences made it hard to migrate code from one platform to another. In Spark 3.2, standard ANSI SQL is now fully supported, making it easier to work with code from other environments. See more here: https://databricks.com/blog/2021/10/19/introducing-apache-spark-3-2.html
Finally, in Spark 3, performance can improve with no changes at all to your code. To achieve scalable computing, Spark is a distributed system, data is divided between many machines that will share the computational workload. But what if the data is lumpy - and the data division strategy results in some machines having large pieces, while others have small? In the past, the effort to tune a Spark query might involve an analysis of the skew in the data, and manual work to repartition the data among the machines. With Adaptive Query Optimization, this is a thing of the past. At a high level, AQE can dynamically assess the data being processed by a Spark query, and as needed, divide up the data on overloaded machines, while combining small datasets on other machines, for a more balanced load. The resulting improvement in execution time does not require the manual tuning effort seen with earlier versions of Spark. In addition, the query optimizer can change the execution plan based on new statistics, switching join strategies or dynamically optimizing skew joins. See more here: https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html?
Stay tuned for more improvements. Spark is a large and active open source project with hundreds of developers from hundreds of organizations, and in its second decade it will keep getting better! To see more please visit spark.apache.org.
★VP Data Engineering ★Data & Analytics Strategy ★Data Management and Integration ★Big Data and AI on Cloud SME ★Data Solutions Delivery ★Data Security ★Astronomy enthusiast
3 年Great article Paul. Can't wait for the LTS Spark 3.2 and give it a try.