Calculate row mean, ignoring NAs in Spark Scala

问题

I'm trying to find a way to calculate the mean of rows in a Spark Dataframe in Scala where I want to ignore NAs. In R, there is a very convenient function called rowMeans where one can specify to ignore NAs:

rowmeans(df,na.rm=TRUE)

I'm unable to find a corresponding function for Spark Dataframes, and I wonder if anyone has a suggestion or input if this would be possible. Replacing them with 0 won't due since this will affect the denominator.

I found a similar question here, however my dataframe will have hundreds of columns.

Any help and shared insights is appreciated, cheers!

回答1:

Usually such functions ignore nulls by default. Even if there are some mixed columns with numeric and string types, this one will drop strings and nulls, and calculate only numerics.

df.select(df.columns.map(c => mean(col(c))) :_*).show

回答2:

You can do this by first identifying which fields are numeric, and then selecting their mean for each row...

import org.apache.spark.sql.types._

val df = List(("a",1,2,3.0),("b",5,6,7.0)).toDF("s1","i1","i2","i3")

// grab numeric fields
val numericFields = df.schema.fields.filter(f => f.dataType==IntegerType || f.dataType==LongType || f.dataType==FloatType || f.dataType==DoubleType || f.dataType==ShortType).map(_.name)

// compute mean
val rowMeans = df.select(numericFields.map(f => col(f)).reduce(_+_) / lit(numericFields.length) as "row_mean")

rowMeans.show

来源：https://stackoverflow.com/questions/43179729/calculate-row-mean-ignoring-nas-in-spark-scala

标签

scala

apache-spark

dataframe

aggregation