问题
I want to update value when userid=22650984.How to do it in pyspark platform?thank you for helping.
>>>xxDF.select('userid','registration_time').filter('userid="22650984"').show(truncate=False)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 0.1 in stage 57.0 (TID 874, shopee-hadoop-slave89, executor 9): TaskKilled (killed intentionally)
18/04/08 10:57:00 WARN TaskSetManager: Lost task 11.1 in stage 57.0 (TID 875, shopee-hadoop-slave97, executor 16): TaskKilled (killed intentionally)
+--------+----------------------------+
|userid |registration_time |
+--------+----------------------------+
|22650984|270972-04-26 13:14:46.345152|
+--------+----------------------------+
回答1:
You can use withColumn to achieve what you are looking to do:
new_df = xxDf.filter(xxDf.userid = "22650984").withColumn(xxDf.field_to_update, <update_expression>)
the update_expression would have your logic for update - could be UDF, or derived field, etc..
回答2:
If you want to modify a subset of your DataFrame and keep the rest unchanged, the best option would be to use pyspark.sql.functions.when() as using filter
or pyspark.sql.functions.where() would remove all rows where the condition is not met.
from pyspark.sql.functions import col, when
valueWhenTrue = None # for example
df.withColumn(
"existingColumnToUpdate",
when(
col("userid") == 22650984,
valueWhenTrue
).otherwise(col("existingColumnToUpdate"))
)
When will evaluate the first argument as a boolean condition. If the condition is True
, it will return the second argument. You can chain together multiple when
statements as shown in this post and also this post. Or use otherwise() to specify what to do when the condition is False
.
In this example, I am updating an existing column "existingColumnToUpdate"
. When the userid
is equal to the specified value, I will update the column with valueWhenTrue
. Otherwise, we will keep the value in the column unchanged.
来源:https://stackoverflow.com/questions/49714413/how-to-modify-one-column-value-in-one-row-used-by-pyspark