Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect()
will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._ preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2) preProcessedData.select(ApplicationId).distinct.collect().foreach(x => { val applicationId = x.getAs[String](ApplicationId) val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId) // DO SOME TASK PER applicationId }) preProcessedData.unpersist()