multiple criteria for aggregation on pySpark Dataframe

笑着哭i 提交于 2019-12-30 08:10:11

问题


I have a pySpark dataframe that looks like this:

+-------------+----------+
|          sku|      date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+

I want to group by sku, and then calculate the min and max dates. If I do this:

df_testing.groupBy('sku') \
    .agg({'date': 'min', 'date':'max'}) \
    .limit(10) \
    .show()

the behavior is the same as Pandas, where I only get the sku and max(date) columns. In Pandas I would normally do the following to get the results I want:

df_testing.groupBy('sku') \
    .agg({'day': ['min','max']}) \
    .limit(10) \
    .show()

However on pySpark this does not work, and I get a java.util.ArrayList cannot be cast to java.lang.String error. Could anyone please point me to the correct syntax?

Thanks.


回答1:


You cannot use dict. Use:

>>> from pyspark.sql import functions as F
>>>
>>> df_testing.groupBy('sku').agg(F.min('date'), F.max('date'))


来源:https://stackoverflow.com/questions/40274508/multiple-criteria-for-aggregation-on-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!