Calculate the mode of a PySpark DataFrame column?

前端 未结 4 588
再見小時候
再見小時候 2021-01-05 15:52

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or

4条回答
  •  梦谈多话
    2021-01-05 16:38

    You can calculate column mode using Java code as follows:

                case MODE:
                    Dataset cnts = ds.groupBy(column).count();
                    Dataset dsMode = cnts.join(
                            cnts.agg(functions.max("count").alias("max_")),
                            functions.col("count").equalTo(functions.col("max_")
                            ));
                    Dataset mode = dsMode.limit(1).select(column);
                    replaceValue = ((GenericRowWithSchema) mode.first()).values()[0];
                    ds = replaceWithValue(ds, column, replaceValue);
                    break;
    
    private static Dataset replaceWithValue(Dataset ds, String column, Object replaceValue) {
        return ds.withColumn(column,
                functions.coalesce(functions.col(column), functions.lit(replaceValue)));
    }
    

提交回复
热议问题