Calculate row mean, ignoring NAs in Spark Scala

狂风中的少年 提交于 2019-12-11 03:24:15

问题


I'm trying to find a way to calculate the mean of rows in a Spark Dataframe in Scala where I want to ignore NAs. In R, there is a very convenient function called rowMeans where one can specify to ignore NAs:

rowmeans(df,na.rm=TRUE)

I'm unable to find a corresponding function for Spark Dataframes, and I wonder if anyone has a suggestion or input if this would be possible. Replacing them with 0 won't due since this will affect the denominator.

I found a similar question here, however my dataframe will have hundreds of columns.

Any help and shared insights is appreciated, cheers!


回答1:


Usually such functions ignore nulls by default. Even if there are some mixed columns with numeric and string types, this one will drop strings and nulls, and calculate only numerics.

df.select(df.columns.map(c => mean(col(c))) :_*).show



回答2:


You can do this by first identifying which fields are numeric, and then selecting their mean for each row...

import org.apache.spark.sql.types._

val df = List(("a",1,2,3.0),("b",5,6,7.0)).toDF("s1","i1","i2","i3")

// grab numeric fields
val numericFields = df.schema.fields.filter(f => f.dataType==IntegerType || f.dataType==LongType || f.dataType==FloatType || f.dataType==DoubleType || f.dataType==ShortType).map(_.name)

// compute mean
val rowMeans = df.select(numericFields.map(f => col(f)).reduce(_+_) / lit(numericFields.length) as "row_mean")

rowMeans.show


来源:https://stackoverflow.com/questions/43179729/calculate-row-mean-ignoring-nas-in-spark-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!