pyspark: count distinct over a window

匿名 (未验证) 提交于 2019-12-03 08:57:35

问题:

I just tried doing a countDistinct over a window and got this error:

AnalysisException: u'Distinct window functions are not supported: count(distinct color#1926) 

Is there a way to do a distinct count over a window in pyspark?

Here's some example code:

from pyspark.sql import functions as F  #function to calculate number of seconds from number of days days = lambda i: i * 86400  df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"),                     (13, "2017-03-15T12:27:18+00:00", "red"),                     (25, "2017-03-18T11:27:18+00:00", "red")],                     ["dollars", "timestampGMT", "color"])  df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))  #create window by casting timestamp to long (number of seconds) w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))  df = df.withColumn('distinct_color_count_over_the_last_week', F.countDistinct("color").over(w))  df.show() 

This is the output I'd like to see:

+-------+--------------------+------+---------------------------------------+ |dollars|        timestampGMT| color|distinct_color_count_over_the_last_week| +-------+--------------------+------+---------------------------------------+ |     17|2017-03-10 15:27:...|orange|                                      1| |     13|2017-03-15 12:27:...|   red|                                      2| |     25|2017-03-18 11:27:...|   red|                                      1| +-------+--------------------+------+---------------------------------------+ 

回答1:

I figured out that I can use a combination of the collect_set and size functions to mimic the functionality of countDistinct over a window:

from pyspark.sql import functions as F  #function to calculate number of seconds from number of days days = lambda i: i * 86400  #create some test data df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"),                     (13, "2017-03-15T12:27:18+00:00", "red"),                     (25, "2017-03-18T11:27:18+00:00", "red")],                     ["dollars", "timestampGMT", "color"])  #convert string timestamp to timestamp type              df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))  #create window by casting timestamp to long (number of seconds) w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))  #use collect_set and size functions to perform countDistinct over a window df = df.withColumn('distinct_color_count_over_the_last_week', F.size(F.collect_set("color").over(w)))  df.show() 

This results in the distinct count of color over the previous week of records:

+-------+--------------------+------+---------------------------------------+ |dollars|        timestampGMT| color|distinct_color_count_over_the_last_week| +-------+--------------------+------+---------------------------------------+ |     17|2017-03-10 15:27:...|orange|                                      1| |     13|2017-03-15 12:27:...|   red|                                      2| |     25|2017-03-18 11:27:...|   red|                                      1| +-------+--------------------+------+---------------------------------------+ 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!