问题
I have a column (myCol) in a Spark dataframe that has values 1,2 and I want to create a new column with the description of this values like 1-> 'A', 2->'B' etc
I know that this can be done with a join but I tried this because it seems more elegant:
dictionary= { 1:'A' , 2:'B' }
add_descriptions = udf(lambda x , dictionary: dictionary[x] if x in dictionary.keys() else None)
df.withColumn("description",add_descriptions(df.myCol,dictionary))
And it fails with error
lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.function s.col. Trace: py4j.Py4JException: Method col([class java.util.HashMap]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)
Is it not possible to have a user difined function with dictionaries as arguments?
回答1:
It is possible, you just have to do it a bit differently.
dictionary= { 1:'A' , 2:'B' }
def add_descriptions(in_dict):
def f(x):
return in_dict.get(x)
return udf(f)
df.withColumn(
"description",
add_descriptions(dictionary)(df.myCol)
)
If you want to add directly your dict in the UDF, as UDFs only accept columns as argument, you need to have a map column to replace your dict.
回答2:
If you are using Spark >= 2.4.0 you can also use the build-in map_from_arrays
function in order to create map on the fly and then get the desired value with getItem
as shown below:
from pyspark.sql.functions import lit, col, map_from_arrays, array
df = spark.createDataFrame([[1],[2],[3]]).toDF("key")
dict = { 1:'A' , 2:'B' }
map_keys = array([lit(k) for k in dict.keys()])
map_values = array([lit(v) for v in dict.values()])
map_func = map_from_arrays(map_keys, map_values)
df = df.withColumn("description", map_func.getItem(df.key))
Output:
+---+-----------+
|key|description|
+---+-----------+
| 1| A|
| 2| B|
| 3| null|
+---+-----------+
来源:https://stackoverflow.com/questions/57037487/spark-udf-with-dictionary-argument-fails