Spark dataframe: collect () vs select ()

前端 未结 6 478
情话喂你
情话喂你 2020-12-13 06:33

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.

Will collect()

6条回答
  •  伪装坚强ぢ
    2020-12-13 07:19

    Short answer in bolds:

    • collect is mainly to serialize
      (loss of parallelism preserving all other data characteristics of the dataframe)
      For example with a PrintWriter pw you can't do direct df.foreach( r => pw.write(r) ), must to use collect before foreach, df.collect.foreach(etc).
      PS: the "loss of parallelism" is not a "total loss" because after serialization it can be distributed again to executors.

    • select is mainly to select columns, similar to projection in relational algebra
      (only similar in framework's context because Spark select not deduplicate data).
      So, it is also a complement of filter in the framework's context.


    Commenting explanations of other answers: I like the Jeff's classification of Spark operations in transformations (as select) and actions (as collect). It is also good remember that transforms (including select) are lazily evaluated.

提交回复
热议问题