Collecting the result of PySpark Dataframe filter into a variable

て烟熏妆下的殇ゞ 提交于 2019-12-23 02:49:12

问题


I am using the PySpark dataframe. My dataset contains three attributes, id, name and address. I am trying to delete the corresponding row based on the name value. What I've been trying is to get unique id of the row I want to delete

ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()

The output I am getting is the following: [Row(id='382')]

I am wondering how can I use id to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing all values == "Bruce" with "John"


回答1:


From the docs for pyspark.sql.DataFrame.collect(), the function:

Returns all the records as a list of Row.

The fields in a pyspark.sql.Row can be accessed like dictionary values.

So for your example:

ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
#[Row(id='382')]

You can access the id field by doing:

id_vals = [r['id'] for r in ID]
#['382']

But looking up one value at a time is generally a bad use for spark DataFrames. You should think about your end goal, and see if there's a better way to do it.


EDIT

Base on your comments, it seems you want to replace the values in the name column with another value. One way to do this is by using pyspark.sql.functions.when().

This function takes a boolean column expression as the first argument. I am using f.col("name") == "Bruce". The second argument is what should be returned if the boolean expression is True. For this example, I am using f.lit(replacement_value).

For example:

import pyspark.sql.functions as f
replacement_value = "Wayne"
df = df.withColumn(
    "name",
    f.when(f.col("name") == "Bruce", f.lit(replacement_value)).otherwise(f.col("name"))
)


来源:https://stackoverflow.com/questions/49428928/collecting-the-result-of-pyspark-dataframe-filter-into-a-variable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!