今天就学习课程吧!
今天就开通帐号,24,100 门业界名师课程任您挑!
Partitions, transformations, lazy evaluations, and actions - Spark DataFrames教程
课程: Apache PySpark by Example
Partitions, transformations, lazy evaluations, and actions
- [Instructor] Earlier, we talked about how Spark is a distributed system. This means that if we want the workers to work in parallel, Spark needs to break the data into chunks or partitions. A partition is a collection of rows from your data frame that sits on one machine in your cluster. So a data frames partition is how the data is physically distributed across the cluster of machines during execution. Now because you're working with a high level API when using data frames, you don't normally get involved with manipulating the partitions manually. Just so you know, if you only have one partition, Spark can't parallelize jobs even if you have a cluster of machines available. In the same way, if you have several partitions but only one worker, Spark can't parallelize jobs as there's only one resource that can do the computation. Transformations are a core data structure in Spark, and are immutable. Immutable is just a fancy way of saying they can't be changed once they've been…