Spark UDF with dictionary argument fails

问题

I have a column (myCol) in a Spark dataframe that has values 1,2 and I want to create a new column with the description of this values like 1-> 'A', 2->'B' etc

I know that this can be done with a join but I tried this because it seems more elegant:

dictionary= { 1:'A' , 2:'B' }

add_descriptions = udf(lambda x , dictionary: dictionary[x] if x in dictionary.keys() else None)

df.withColumn("description",add_descriptions(df.myCol,dictionary))

And it fails with error

lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.function  s.col. Trace:
py4j.Py4JException: Method col([class java.util.HashMap]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
        at py4j.Gateway.invoke(Gateway.java:274)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)

Is it not possible to have a user difined function with dictionaries as arguments?

回答1:

It is possible, you just have to do it a bit differently.

dictionary= { 1:'A' , 2:'B' }

def add_descriptions(in_dict):
    def f(x):
        return in_dict.get(x)
    return udf(f)

df.withColumn(
    "description",
    add_descriptions(dictionary)(df.myCol)
)

If you want to add directly your dict in the UDF, as UDFs only accept columns as argument, you need to have a map column to replace your dict.

回答2:

If you are using Spark >= 2.4.0 you can also use the build-in map_from_arrays function in order to create map on the fly and then get the desired value with getItem as shown below:

from pyspark.sql.functions import lit, col, map_from_arrays, array
df = spark.createDataFrame([[1],[2],[3]]).toDF("key")

dict = { 1:'A' , 2:'B' }

map_keys = array([lit(k) for k in dict.keys()])
map_values = array([lit(v) for v in dict.values()])
map_func = map_from_arrays(map_keys, map_values) 

df = df.withColumn("description", map_func.getItem(df.key))

Output:

+---+-----------+
|key|description|
+---+-----------+
|  1|          A|
|  2|          B|
|  3|       null|
+---+-----------+

来源：https://stackoverflow.com/questions/57037487/spark-udf-with-dictionary-argument-fails

标签

python

apache-spark

pyspark