Register UDF to SqlContext from Scala to use in PySpark

蓝咒 提交于 2019-11-29 07:25:39
zero323

As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly.

The simplest solution here is to use raw SQL expression:

mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))

## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")

## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")

This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions?

The following worked for me (basically a summary of multiple places including the link provided by zero323):

In scala:

package com.example
import org.apache.spark.sql.functions.udf

object udfObj extends Serializable {
  def createUDF = {
    udf((x: Int) => x + 1)
  }
}

in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session):

from py4j.java_gateway import java_import
from pyspark.sql.column import Column

jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
    return Column(jvm.com.example.udfObj.createUDF().apply(col))

And of course make sure the jar created in scala is added using --jars and --driver-class-path

So what happens here:

We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects).

In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import. We access the createUDF function and call it. This creates an object which has the apply method (functions in scala are actually java objects with the apply method). The input to the apply method is a column. The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!