get the distinct elements of an ArrayType column in a spark dataframe

问题

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?

回答1:

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])

回答2:

The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:

def array_unique_values(df, fields):
    from pyspark.sql.functions import col, collect_set, explode
    from functools import reduce
    data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
    return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])

And then:

data = array_unique_values(df, my_fields)
data.take(1)

来源：https://stackoverflow.com/questions/37801889/get-the-distinct-elements-of-an-arraytype-column-in-a-spark-dataframe

标签

scala

spark-dataframe