Want to learn more about #pyspark #memory profiling? Have questions about PySpark #UDFs, profiling hot loops of a UDF, profile memory of a UDF? This #AMA is a follow-up to the popular post How to Profile PySpark https://lnkd.in/gQ2MzXG5 #apachespark
关于我们
Apache Spark? is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Key Features - Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R. - SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses. - Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling - Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines. The most widely-used engine for scalable computing Thousands of companies, including 80% of the Fortune 500, use Apache Spark?. Over 2,000 contributors to the open source project from industry and academia. Ecosystem Apache Spark? integrates with your favorite frameworks, helping to scale them to thousands of machines.
- 网站
-
https://spark.apache.org/
Apache Spark的外部链接
- 所属行业
- 科技、信息和网络
- 规模
- 51-200 人
- 总部
- Berkeley,CA
- 类型
- 非营利机构
- 领域
- Apache Spark、Big Data、Machine Learning、SQL Analytics、Batch和Streaming
地点
-
主要
US,CA,Berkeley
Apache Spark员工
动态
-
Spark lets you accelerate count-distinct operations with the HyperLogLog algorithm. Computing the number of distinct items in a column can be an expensive operation because many rows of data must be scanned. The HyperLogLog algorithm stores the unique values of a column in a sketch, which allows for a quick approximate distinct count. The HyperLogLog isn't precise and returns an approximate result. It sacrifices some accuracy for much more speed. HyperLogLog sketches can be unioned with other sketches, so they're useful for incrementally updated pipelines in production data settings. PySpark 3.5+ supports HLL functions out of the box.
-
It's best to avoid Spark User Defined Functions (UDFs) when you can simply use Spark native functions. The following example shows a poorly defined UDF that blows up with NULL input. You always want to handle the NULL case in your UDF. The second snippet shows how to define a UDF that gracefully handles the NULL case, which is an improvement. However, the third snippet shows that you don't need a UDF in this case. It's easy to append to a string column using native Spark functions. Native Spark functions gracefully handle the NULL case and can be optimized. UDFs are a black box from the optimizer's perspective, so they're harder to optimize.
-
Spark makes it simple to snake_case all of the columns in a DataFrame. Just create a function that downcases the string, replaces spaces with underscores, and apply that function to each column. You could also run withColumn() in a loop, but it's more performant to run a single select(). PySpark allows for some really nice abstractions.
-
You can easily single space a string with PySpark using the built-in trim and regexp_replace functions. The following example shows a column with irregular spacing and a single_space function that cleans the whitespace. It's nice to wrap code in functions so it's readable and easier to unit test. PySpark allows for clean code!
-
It's straightforward to run PySpark in Jupyter notebooks. Start by creating and activating a virtual environment with PySpark installed. Then create a Spark Session and you can run any Spark code. For example, you can easily create a Parquet table and read the table contents. It's a good idea to set the log level to ERROR to limit the WARN level terminal outputs. You can even use the JupyterLab Desktop application to run the Spark code in a separate application!