Pyspark dataframes: Extract a column based on the value of another column

落花浮王杯 提交于 2020-01-03 03:16:10

问题


I have a dataframe with the following columns and corresponding values (forgive my formatting but dont know how to put it in table format):

Src_ip     dst_ip     V1     V2     V3     top
"A"         "B"       xx     yy     zz     "V1"

Now I want to add a column, lets say top_value which takes the value of column corresponding to the string in V1.

Src_ip     dst_ip     V1     V2     V3     top   top_value
"A"         "B"       xx     yy     zz     "V1"     xx

So basically, get the value corresponding to the value in the column "top" and make a new column named "top_value"

I have tried by creating UDFs as well as using the string as an alias but unable to do so. Can anyone please help.


回答1:


You can collect the V1, V2 and V3 columns as struct and pass to a udf function with the top column and extract the value as

scala

import org.apache.spark.sql.functions._
def findValueUdf = udf((strct: Row, top: String) => strct.getAs[String](top))

df.withColumn("top_value", findValueUdf(struct("V1", "V2", "V3"), col("top")))

which should give you

+------+------+---+---+---+---+---------+
|Src_ip|dst_ip|V1 |V2 |V3 |top|top_value|
+------+------+---+---+---+---+---------+
|A     |B     |xx |yy |zz |V1 |xx       |
+------+------+---+---+---+---+---------+

pyspark

equivalent code in pyspark would be

from pyspark.sql import functions as f
from pyspark.sql import types as t
def findValueUdf(strct, top):
    return strct[top]

FVUdf = f.udf(findValueUdf, t.StringType())

df.withColumn("top_value", FVUdf(f.struct("V1", "V2", "V3"), f.col("top")))

Moreover you can define the column names in a list to be used in struct function so that you don't have to hard code them.

I hope the answer is helpful



来源:https://stackoverflow.com/questions/50577347/pyspark-dataframes-extract-a-column-based-on-the-value-of-another-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!