Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame

问题

I have a dataframe with two columns "date" and "value", how do I add 2 new columns "value_mean" and "value_sd" to the dataframe where "value_mean" is the average of "value" over the last 10 days (including the current day as specified in "date") and "value_sd" is the standard deviation of the "value" over the last 10 days?

回答1:

Spark sql provide the various dataframe function like avg,mean,sum etc.

you just have to apply on dataframe column using spark sql column

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

Create private method for standard deviation

private def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))

Now you can create sql Column for average and standard deviation

val value_sd: org.apache.spark.sql.Column = stddev(df.col("value")).as("value_sd")
val value_mean: org.apache.spark.sql.Column = avg(df.col("value").as("value_mean"))

Filter your dataframe for last 10 days or as you want

val filterDF=df.filter("")//put your filter condition

Now yon can apply the aggregate function on your filterDF

filterDF.agg(stdv, value_mean).show

来源：https://stackoverflow.com/questions/35348519/spark-scala-how-do-i-iterate-rows-in-dataframe-and-add-calculated-values-as-n

标签

scala

apache-spark

apache-spark-sql

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!