Calling collect()
on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.
Will collect()
Short answer in bolds:
collect
is mainly to serialize
(loss of parallelism preserving all other data characteristics of the dataframe)
For example with a PrintWriter pw
you can't do direct df.foreach( r => pw.write(r) )
, must to use collect
before foreach
, df.collect.foreach(etc)
.
PS: the "loss of parallelism" is not a "total loss" because after serialization it can be distributed again to executors.
select
is mainly to select columns, similar to projection in relational algebra
(only similar in framework's context because Spark select
not deduplicate data).
So, it is also a complement of filter
in the framework's context.
Commenting explanations of other answers: I like the Jeff's classification of Spark operations in transformations (as select
) and actions (as collect
). It is also good remember that transforms (including select
) are lazily evaluated.