Store aggregate value of a PySpark dataframe column into a variable

前端 未结 6 870
故里飘歌
故里飘歌 2021-01-13 09:37

I am working with PySpark dataframes here. \"test1\" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the resu

6条回答
  •  南方客
    南方客 (楼主)
    2021-01-13 10:09

    I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.

    df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.

    If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()

提交回复
热议问题