Fetching distinct values on a column using Spark DataFrame

匿名 (未验证) 提交于 2019-12-03 08:36:05

问题:

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?

 import sqlContext.implicits._  preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)   preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {    val applicationId = x.getAs[String](ApplicationId)    val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)    // DO SOME TASK PER applicationId  })   preProcessedData.unpersist()   

回答1:

Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.

For example:

val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")  // I obtain all different values. If you show you must see only {1, 3} val distinctValuesDF = df.select(df("age")).distinct  // Define your udf. In this case I defined a simple function, but they can get complicated. val myTransformationUDF = udf(value => value / 10)  // Run that transformation "over" your DataFrame val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age"))) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!