Ultimately what I want is the mode of a column, for all the columns in the DataFrame. For other summary statistics, I see a couple of options: use DataFrame aggregation, or
You can calculate column mode using Java code as follows:
case MODE:
Dataset cnts = ds.groupBy(column).count();
Dataset dsMode = cnts.join(
cnts.agg(functions.max("count").alias("max_")),
functions.col("count").equalTo(functions.col("max_")
));
Dataset mode = dsMode.limit(1).select(column);
replaceValue = ((GenericRowWithSchema) mode.first()).values()[0];
ds = replaceWithValue(ds, column, replaceValue);
break;
private static Dataset replaceWithValue(Dataset ds, String column, Object replaceValue) {
return ds.withColumn(column,
functions.coalesce(functions.col(column), functions.lit(replaceValue)));
}