Define spark udf by reflection on a String

末鹿安然 提交于 2019-12-04 11:12:09

As Giovanny observed, the problem lies in the class loaders being different (you can investigate this more by calling .getClass.getClassLoader on whatever object). Then, when the workers try to deserialize your reflected function, all hell breaks loose.

Here is a solution that does not involve any class loader hackery. The idea is to move the reflection step to the workers. We'll end up having to redo the reflection step, but only once per worker. I think this is pretty optimal - even if you did the reflection only once on the master node, you would have to do a fair bit of work per worker to get them to recognize the function.

val f = udf {
  new Function1[String,Int] with Serializable {
    import scala.reflect.runtime.universe._
    import scala.reflect.runtime.currentMirror
    import scala.tools.reflect.ToolBox

    lazy val toolbox = currentMirror.mkToolBox()
    lazy val func = {
      println("reflected function") // triggered at every worker
      toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]
    }

    def apply(s: String): Int = func(s)
  }
}

Then, calling sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show works just fine.

Feel free to comment out the println - it is just an easy way of counting how many times the reflection happened. In spark-shell --master 'local' that's only once, but in spark-shell --master 'local[2]' it's twice.

How it works

The UDF gets evaluated immediately, but it never gets used until it reaches the worker nodes, so the lazy values toolbox and func only get evaluated on the workers. Furthermore, since they are lazy, they only ever get evaluated once per worker.

I had the same error, and it doesn't show the ClassNotFoundException because the JavaDeserializationStream class is catching the exception, depending on your environment it is failing because it coudn't find the class you're trying to execute from your RDD/DataSet but it doesn't show the ClassNotFoundError . To fix this issue I had to generate a jar with all the classes on my project (including the function and dependencies ) and include the jar inside the spark environment

This for an standalone cluster

conf.setJars ( Array ("/fullpath/yourgeneratedjar.jar", "/fullpath/otherdependencies.jar") )

and this for a yarn cluster

conf.set("spark.yarn.jars", "/fullpath/yourgeneratedjar.jar,/fullpath/otherdependencies.jar")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!