Store aggregate value of a PySpark dataframe column into a variable

前端未结

关注

 6  870

故里飘歌 2021-01-13 09:37

I am working with PySpark dataframes here. \"test1\" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the resu

6条回答

南方客 (楼主)

2021-01-13 10:09

I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.

df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.

If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...