Spark dataframe: collect () vs select ()

前端 未结 6 481
情话喂你
情话喂你 2020-12-13 06:33

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.

Will collect()

6条回答
  •  青春惊慌失措
    2020-12-13 07:13

    calling select will result is lazy evaluation: for example:

    val df1 = df.select("col1")
    val df2 = df1.filter("col1 == 3")
    

    both above statements create lazy path that will be executed when you call action on that df, such as show, collect etc.

    val df3 = df2.collect()
    

    use .explain at the end of your transformation to follow its plan here is more detailed info Transformations and Actions

提交回复
热议问题