问题

I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format):

Src_ip     dst_ip     V1     V2     V3     top
"A"         "B"       xx     yy     zz     "V1"

Now I want to add a column, lets say top_value which takes the value of column corresponding to the string in V1.

Src_ip     dst_ip     V1     V2     V3     top   top_value
"A"         "B"       xx     yy     zz     "V1"     xx

So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value"

I have tried by creating UDFs as well as using the string as an alias but unable to do so. Can anyone please help.

回答1:

You can collect the V1, V2 and V3 columns as struct and pass to a udf function with the top column and extract the value as

scala

import org.apache.spark.sql.functions._
def findValueUdf = udf((strct: Row, top: String) => strct.getAs[String](top))

df.withColumn("top_value", findValueUdf(struct("V1", "V2", "V3"), col("top")))

which should give you

+------+------+---+---+---+---+---------+
|Src_ip|dst_ip|V1 |V2 |V3 |top|top_value|
+------+------+---+---+---+---+---------+
|A     |B     |xx |yy |zz |V1 |xx       |
+------+------+---+---+---+---+---------+

pyspark

equivalent code in pyspark would be

from pyspark.sql import functions as f
from pyspark.sql import types as t
def findValueUdf(strct, top):
    return strct[top]

FVUdf = f.udf(findValueUdf, t.StringType())

df.withColumn("top_value", FVUdf(f.struct("V1", "V2", "V3"), f.col("top")))

Moreover you can define the column names in a list to be used in struct function so that you don't have to hard code them.

I hope the answer is helpful

来源：https://stackoverflow.com/questions/50577347/pyspark-dataframes-extract-a-column-based-on-the-value-of-another-column

标签

apache-spark

pyspark

Pyspark dataframes: Extract a column based on the value of another column

问题

回答1:

scala

pyspark