get the distinct elements of an ArrayType column in a spark dataframe

空扰寡人 提交于 2019-12-14 01:10:38

问题


I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?


回答1:


You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])



回答2:


The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:

def array_unique_values(df, fields):
    from pyspark.sql.functions import col, collect_set, explode
    from functools import reduce
    data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
    return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])

And then:

data = array_unique_values(df, my_fields)
data.take(1)


来源:https://stackoverflow.com/questions/37801889/get-the-distinct-elements-of-an-arraytype-column-in-a-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!