Calculate the mode of a PySpark DataFrame column?

前端 未结 4 593
再見小時候
再見小時候 2021-01-05 15:52

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or

4条回答
  •  被撕碎了的回忆
    2021-01-05 16:33

    >>> df=newdata.groupBy('columnName').count()
    >>> mode = df.orderBy(df['count'].desc()).collect()[0][0]
    
    See My result
    
    >>> newdata.groupBy('var210').count().show()
    +------+-----+
    |var210|count|
    +------+-----+
    |  3av_|   64|
    |  7A3j|  509|
    |  g5HH| 1489|
    |  oT7d|  109|
    |  DM_V|  149|
    |  uKAI|44883|
    +------+-----+
    
    # store the above result in df
    >>> df=newdata.groupBy('var210').count()
    >>> df.orderBy(df['count'].desc()).collect()
    [Row(var210='uKAI', count=44883),
    Row(var210='g5HH', count=1489),
    Row(var210='7A3j', count=509),
    Row(var210='DM_V', count=149),
    Row(var210='oT7d', count=109),
    Row(var210='3av_', count=64)]
    
    # get the first value using collect()
    >>> mode = df.orderBy(df['count'].desc()).collect()[0][0]
    >>> mode
    'uKAI'
    

    using groupBy() function getting count of each category in column. df is my result data frame has two columns var210,count. using orderBy() with column name 'count' in descending order give the max value in 1st row of data frame. collect()[0][0] is used to get the 1 tuple in data frame

提交回复
热议问题