PySpark 2.1: Importing module with UDF's breaks Hive connectivity

问题

I'm currently working with Spark 2.1 and have a main script that calls a helper module that contains all my transformation methods. In other words:

main.py
helper.py

At the top of my helper.py file I have several custom UDFs that I have defined in the following manner:

def reformat(s):
  return reformat_logic(s)
reformat_udf = udf(reformat, StringType())

Before I broke off all the UDFs into the helper file, I was able to connect to my Hive metastore through my SparkSession object using spark.sql('sql statement'). However, after I moved the UDFs to the helper file and imported that file at the top of my main script, the SparkSession object could no longer connect to Hive and went back to the default Derby database. I also get errors when trying to query my Hive tables such as Hive support is required to insert into the following tables...

I've been able to solve my issue by moving my UDFs into a completely separate file and only running the import statements for that module inside the functions that need them (not sure if this is good practice, but it works). Anyway, does anyone understand why I'm seeing such peculiar behavior when it comes to Spark and UDFs? And does anyone know a good way to share UDFs across applications?

回答1:

Prior to Spark 2.2.0 UserDefinedFunction eagerly creates UserDefinedPythonFunction object, which represents Python UDF on JVM. This process requires access to SparkContext and SparkSession. If there are no active instances when UserDefinedFunction.__init__ is called, Spark will automatically initialize the contexts for you.

When you call SparkSession.Builder.getOrCreate after importing UserDefinedFunction object it returns existing SparkSession instance and only some configuration changes can be applied (enableHiveSupport is not among these).

To address this problem you should initialize SparkSession before you import UDF:

from pyspark.sql.session import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

from helper import reformat_udf

This behavior is described in SPARK-19163 and fixed in Spark 2.2.0. Other API improvements include decorator syntax (SPARK-19160) and improved docstrings handling (SPARK-19161).

来源：https://stackoverflow.com/questions/43795915/pyspark-2-1-importing-module-with-udfs-breaks-hive-connectivity

标签

python

apache-spark

pyspark

apache-spark-sql

user-defined-functions