Spark DataFrames when udf functions do not accept large enough input variables

孤人 提交于 2019-11-27 05:36:56
zero323

User defined functions are defined for up to 22 parameters. Only udf helper is define for at most 10 arguments. To handle functions with larger number of parameters you can use org.apache.spark.sql.UDFRegistration.

For example

val dummy = ((
  x0: Int, x1: Int, x2: Int, x3: Int, x4: Int, x5: Int, x6: Int, x7: Int, 
  x8: Int, x9: Int, x10: Int, x11: Int, x12: Int, x13: Int, x14: Int, 
  x15: Int, x16: Int, x17: Int, x18: Int, x19: Int, x20: Int, x21: Int) => 1)

van be registered:

import org.apache.spark.sql.expressions.UserDefinedFunction

val dummyUdf: UserDefinedFunction = spark.udf.register("dummy", dummy)

and use directly

val df = spark.range(1)
val exprs =  (0 to 21).map(_ => lit(1))

df.select(dummyUdf(exprs: _*))

or by name via callUdf

import org.apache.spark.sql.functions.callUDF

df.select(
  callUDF("dummy", exprs:  _*).alias("dummy")
)

or SQL expression:

df.selectExpr(s"""dummy(${Seq.fill(22)(1).mkString(",")})""")

You can also create an UserDefinedFunction object:

import org.apache.spark.sql.expressions.UserDefinedFunction

Seq(1).toDF.select(UserDefinedFunction(dummy, IntegerType, None)(exprs: _*))

In practice having a function with 22 arguments is not very useful and unless you want to use Scala reflection to generate these there are maintenance nightmare.

I would either consider using collections (array, map) or struct as an input or divide this into multiple modules. For example:

val aLongArray = array((0 to 256).map(_ => lit(1)): _*)

val udfWitharray = udf((xs: Seq[Int]) => 1)

Seq(1).toDF.select(udfWitharray(aLongArray).alias("dummy"))

Just to expand on zero's answer, it is possible to get .withColumn() function to work with a UDF that has more than 10 parameters. Just need to spark.udf.register() the function and then use an expr for the argument for adding the column (instead of a udf).

For example, something like this should work:

def mergeFunction(...) // with 14 input variables

spark.udf.register("mergeFunction", mergeFunction) // make available in expressions

df.groupBy("id").agg(
   collect_list(df(...)) as ...
   ... // too many of these (something like 14 of them)
).withColumn("features_labels",
  expr("mergeFunction(col1, col2, col3, col4, ...)") ) //pass in the 14 column names
.select("id", "feature_labels")

The underlying expression parser seems to handle more than 10 parameters so I don't think you have to resort to passing around arrays to call the function. Also, if they parameters happen to be different data types, arrays would not work very well.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!