Calculate UDF once

匆匆过客 提交于 2021-02-08 10:00:12

问题


I want to have a UUID column in a pyspark dataframe that is calculated only once, so that I can select the column in a different dataframe and have the UUIDs be the same. However, the UDF for the UUID column is recalculated when I select the column.

Here's what I'm trying to do:

>>> uuid_udf = udf(lambda: str(uuid.uuid4()), StringType())
>>> a = spark.createDataFrame([[1, 2]], ['col1', 'col2'])
>>> a = a.withColumn('id', uuid_udf())
>>> a.collect()
[Row(col1=1, col2=2, id='5ac8f818-e2d8-4c50-bae2-0ced7d72ef4f')]
>>> b = a.select('id')
>>> b.collect()
[Row(id='12ec9913-21e1-47bd-9c59-6ddbe2365247')]  # Wanted this to be the same ID as above

Possible workaround: rand()

A possible workaround might be to use pyspark.sql.functions.rand() as my source of randomness. However, there are two problems:

1) I'd like to have letters, not just numbers, in the UUID, so that it doesn't need to be quite as long

2) Though it technically works, it produces ugly UUIDs:

>>> from pyspark.sql.functions import rand, round
>>> a = a.withColumn('id', round(rand() * 10e16))
>>> a.collect()
[Row(col1=1, col2=2, id=7.34745165108606e+16)]

回答1:


Use Spark built-in uuid function instead:

a = a.withColumn('id', expr("uuid()"))
b = a.select('id')

b.collect()
[Row(id='da301bea-4927-4b6b-a1cf-518dea8705c4')]

a.collect()
[Row(col1=1, col2=2, id='da301bea-4927-4b6b-a1cf-518dea8705c4')]



回答2:


The reason why your UUID keeps changing is because your dataframe is computed again and again after each action.

To stabilize your result, you can just use persist or cache (depending on the size of your dataframe).

df.persist()

df.show()                                                                                          
+---+--------------------+
| id|                uuid|
+---+--------------------+
|  0|e3db115b-6b6a-424...|
+---+--------------------+


b = df.select("uuid")                                                                              

b.show()                                                                                           
+--------------------+
|                uuid|
+--------------------+
|e3db115b-6b6a-424...|
+--------------------+


来源:https://stackoverflow.com/questions/59843216/calculate-udf-once

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!