问题
I'm using spark 2.2. and I want to normalize each value in the fixed-size array.
input
{"values": [1,2,3,4]}
output
{"values": [0.25, 0.5, 0.75, 1] }
For now, I'm using a udf :
val f = udf { (l: Seq[Double]) =>
val max = l.max
l.map(_ / max)
}
Is there a way to avoid udf (and associated performance penalty).
回答1:
Lets say that number of records in each array is n
val n: Int
Then
import org.apache.spark.sql.functions._
df
.withColumn("max", greatest((0 until n).map(i => col("value")(i)): _*))
.withColumn("values", array((0 until n).map(i => col("value")(i) / col("max")): _*))
回答2:
I've come up with an optimized version of my udf, which performs in-place updates.
val optimizedNormalizeUdf = udf { (l: mutable.WrappedArray[Double]) =>
val max = l.max
(0 until n).foreach(i => l.update(i, l(i) / max))
l
}
I've written a benchmark to check performance of the solution proposed by user8838736. Here are the results.
[info] Benchmark Mode Cnt Score Error Units
[info] NormalizeBenchmark.builtin avgt 10 140,293 ± 10,805 ms/op
[info] NormalizeBenchmark.udf_naive avgt 10 104,708 ± 7,421 ms/op
[info] NormalizeBenchmark.udf_optimized avgt 10 99,492 ± 7,829 ms/op
Conclusion : The udf is the most performant solution in this case.
PS : For those who are interested, the source code of the benchmark is here : https://github.com/YannMoisan/spark-jmh
来源:https://stackoverflow.com/questions/46513246/how-to-normalize-an-array-column-in-a-dataframe