Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.
Will collect()
calling select will result is lazy evaluation: for example:
val df1 = df.select("col1")
val df2 = df1.filter("col1 == 3")
both above statements create lazy path that will be executed when you call action on that df, such as show, collect etc.
val df3 = df2.collect()
use .explain at the end of your transformation to follow its plan
here is more detailed info Transformations and Actions