Calculate the mode of a PySpark DataFrame column?

前端未结

关注

 4  588

再見小時候 2021-01-05 15:52

Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or

4条回答

梦谈多话 (楼主)

2021-01-05 16:38

You can calculate column mode using Java code as follows:

            case MODE:
                Dataset cnts = ds.groupBy(column).count();
                Dataset dsMode = cnts.join(
                        cnts.agg(functions.max("count").alias("max_")),
                        functions.col("count").equalTo(functions.col("max_")
                        ));
                Dataset mode = dsMode.limit(1).select(column);
                replaceValue = ((GenericRowWithSchema) mode.first()).values()[0];
                ds = replaceWithValue(ds, column, replaceValue);
                break;

private static Dataset replaceWithValue(Dataset ds, String column, Object replaceValue) {
    return ds.withColumn(column,
            functions.coalesce(functions.col(column), functions.lit(replaceValue)));
}

0 讨论(0)

查看其它4个回答