How to normalize an array column in a dataframe

问题

I'm using spark 2.2. and I want to normalize each value in the fixed-size array.

input

{"values": [1,2,3,4]}

output

{"values": [0.25, 0.5, 0.75, 1] }

For now, I'm using a udf :

val f = udf { (l: Seq[Double]) =>
  val max = l.max
  l.map(_ / max)
}

Is there a way to avoid udf (and associated performance penalty).

回答1:

Lets say that number of records in each array is n

val n: Int

Then

 import org.apache.spark.sql.functions._


df
  .withColumn("max", greatest((0 until n).map(i => col("value")(i)): _*))
  .withColumn("values", array((0 until n).map(i => col("value")(i) / col("max")): _*))

回答2:

I've come up with an optimized version of my udf, which performs in-place updates.

  val optimizedNormalizeUdf = udf { (l: mutable.WrappedArray[Double]) =>
    val max = l.max
    (0 until n).foreach(i => l.update(i, l(i) / max))
    l
  }

I've written a benchmark to check performance of the solution proposed by user8838736. Here are the results.

[info] Benchmark                         Mode  Cnt    Score    Error  Units
[info] NormalizeBenchmark.builtin        avgt   10  140,293 ± 10,805  ms/op
[info] NormalizeBenchmark.udf_naive      avgt   10  104,708 ±  7,421  ms/op
[info] NormalizeBenchmark.udf_optimized  avgt   10   99,492 ±  7,829  ms/op

Conclusion : The udf is the most performant solution in this case.

PS : For those who are interested, the source code of the benchmark is here : https://github.com/YannMoisan/spark-jmh

来源：https://stackoverflow.com/questions/46513246/how-to-normalize-an-array-column-in-a-dataframe

标签

apache-spark

dataframe

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!