Spark Transformations

Spark Transformations

Transformations is a kind of process that will transform your RDD data from one form to another in Spark. and when you apply this operation on an RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable).

  • Transformations are the core of how you will be expressing your business logic using DAG in Spark and intermediate operation conversion from one form to another RDD.

Two types of transformations in SPARK:

  • Narrow Transformations
  • Wide Transformations

Narrow Transformations:

These types of transformations convert each input partition to only one output partition. When each partition at the parent RDD is used by at most one partition of the child RDD or when each partition from child produced or dependent on single parent RDD.

  • This kind of transformation is basically fast.
  • Does not require any data shuffling over the cluster network or no data movement.
  • Operation of map()and filter() belongs to this transformations.

Wide Transformations:

This type of transformation will have input partitions contributing to many output partitions. When each partition at the parent RDD is used by multiple partitions of the child RDD or when each partition from child produced or dependent on multiple parent RDD.

  • Slow as compare to narrow dependencies speed might be significantly affected as it might be required to shuffle data around different nodes when creating new partitions.
  • Might Require data shuffling over the cluster network or no data movement.
  • Functions such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some examples of wider transformations.

#apachespark #spark #bigdata


Nicely explained

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了