I am working with PySpark dataframes here. \"test1\" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the resu
I'm pretty sure df.select([max('event_date')])
returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date
.
df.select('event_date').distinct().count()
returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.
If you want code to get the max event_date
and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()