pySpark withColumn with a function

问题

I have a dataframe which has 2 columns: account_id and email_address, now I want to add one more column 'updated_email_address' which i call some function on email_address to get the updated_email_address. here is my code:

def update_email(email):
  print("== email to be updated: " + email)
  today = datetime.date.today()
  updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated"
  return updated

df.withColumn('updated_email_address', update_email(df.email_address))

but the result showed 'updated_email_address' column as null:

+---------------+--------------+---------------------+
|account_id     |email_address |updated_email_address|
+---------------+--------------+---------------------+
|123456gd7tuhha |abc@test.com  |null           |
|djasevneuagsj1 |cde@test.com  |null           |
+---------------+--------------+---------------+

inside the function 'updated_email' it printed out:

Column<b'(email_address + == email to be udpated: )'>

also it showed the df's column data type as:

dfData:pyspark.sql.dataframe.DataFrame
account_id:string
email_address:string
updated_email_address:double

why is updated_email_address column type of double?

回答1:

You're calling a Python function with Column type. You have to create udf from update_email and then use it:

update_email_udf = udf(update_email)

However, I'd suggest you to not use UDF fot such transformation, you could do it using only Spark built-in functions (UDFs are known for bad performance) :

df.withColumn('updated_email_address',
              concat(substring(col("email_address"), -8, 8), date_format(current_date(), "ddMM"), lit("_updated"))
             ).show()

You can find here all Spark SQL built-in functions.

回答2:

Well thanks to you I got to relearn something I forgot in my spark class

You can't call directly your custom functions with WithColumn, you need to use UserDefinedFunctions (UDF)

Here is a quick example of how I got a custom function to work with your dataframe (StringType is the return type of the function)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def update_email(email):
  return email+"aaaa"
#df.dtypes

my_udf = udf(lambda x: update_email(x), StringType())

df.withColumn('updated_email_address', my_udf(df.email_address) ).show()

来源：https://stackoverflow.com/questions/59317300/pyspark-withcolumn-with-a-function

标签

pyspark

databricks