Apply custom function to cells of selected columns of a data frame in PySpark

和自甴很熟 提交于 2021-02-07 03:32:39

问题


Let's say I have a data frame which looks like this:

+---+-----------+-----------+
| id|   address1|   address2|
+---+-----------+-----------+
|  1|address 1.1|address 1.2|
|  2|address 2.1|address 2.2|
+---+-----------+-----------+

I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example:

def example(string1, string2):
    name_1 = string1.lower().split(' ')
    name_2 = string2.lower().split(' ')
    intersection_count = len(set(name_1) & set(name_2))

    return intersection_count

I want to store the result in a new column, so that my final data frame would look like:

+---+-----------+-----------+------+
| id|   address1|   address2|result|
+---+-----------+-----------+------+
|  1|address 1.1|address 1.2|     2|
|  2|address 2.1|address 2.2|     7|
+---+-----------+-----------+------+

I've tried to execute it in a way I once applied a built-in function to the whole column, but I got an error:

>>> df.withColumn('result', example(df.address1, df.address2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in example
TypeError: 'Column' object is not callable

What am I doing wrong and how I can apply a custom function to strings in selected columns?


回答1:


You have to use udf (user defined function) in spark

from pyspark.sql.functions import udf
example_udf = udf(example, LongType())
df.withColumn('result', example_udf(df.address1, df.address2))


来源:https://stackoverflow.com/questions/45367824/apply-custom-function-to-cells-of-selected-columns-of-a-data-frame-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!