Collecting the result of PySpark Dataframe filter into a variable

问题

I am using the PySpark dataframe. My dataset contains three attributes, id, name and address. I am trying to delete the corresponding row based on the name value. What I've been trying is to get unique id of the row I want to delete

ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()

The output I am getting is the following: [Row(id='382')]

I am wondering how can I use id to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing all values == "Bruce" with "John"

回答1:

From the docs for pyspark.sql.DataFrame.collect(), the function:

Returns all the records as a list of Row.

The fields in a pyspark.sql.Row can be accessed like dictionary values.

So for your example:

ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
#[Row(id='382')]

You can access the id field by doing:

id_vals = [r['id'] for r in ID]
#['382']

But looking up one value at a time is generally a bad use for spark DataFrames. You should think about your end goal, and see if there's a better way to do it.

EDIT

Base on your comments, it seems you want to replace the values in the name column with another value. One way to do this is by using pyspark.sql.functions.when().

This function takes a boolean column expression as the first argument. I am using f.col("name") == "Bruce". The second argument is what should be returned if the boolean expression is True. For this example, I am using f.lit(replacement_value).

For example:

import pyspark.sql.functions as f
replacement_value = "Wayne"
df = df.withColumn(
    "name",
    f.when(f.col("name") == "Bruce", f.lit(replacement_value)).otherwise(f.col("name"))
)

来源：https://stackoverflow.com/questions/49428928/collecting-the-result-of-pyspark-dataframe-filter-into-a-variable

标签

python

dataframe

pyspark