PySpark- How to use a row value from one column to access another column which has the same name as of the row value

后端 未结 2 1178
猫巷女王i
猫巷女王i 2020-12-17 05:12

I have a PySpark df:

+---+---+---+---+---+---+---+---+
| id| a1| b1| c1| d1| e1| f1|ref|
+---+---+---+---+---+---+---+---+
|  0|  1| 23|  4|  8|  9|  5| b1|
         


        
2条回答
  •  北海茫月
    2020-12-17 05:46

    Independent of version you can convert to RDD, map, and convert back to DataFrame:

    df = spark.createDataFrame(
        [(0, 1, 23, 4, 8, 9, 5, "b1"), (1, 2, 43, 8, 10, 20, 43, "e1")], 
        ("id", "a1", "b1", "c1", "d1", "e1", "f1", "ref")
    )
    
    df.rdd.map(lambda row: row + (row[row.ref], )).toDF(df.columns + ["out"])
    
    +---+---+---+---+---+---+---+---+---+
    | id| a1| b1| c1| d1| e1| f1|ref|out|
    +---+---+---+---+---+---+---+---+---+
    |  0|  1| 23|  4|  8|  9|  5| b1| 23|
    |  1|  2| 43|  8| 10| 20| 43| e1| 20|
    +---+---+---+---+---+---+---+---+---+
    

    You could also preserve schema

    from pyspark.sql.types import LongType, StructField
    
    spark.createDataFrame(
        df.rdd.map(lambda row: row + (row[row.ref], )), 
        df.schema.add(StructField("out", LongType())))
    

    With DataFrames you can compose complex Columns. In 1.6:

    from pyspark.sql.functions import array, col, udf
    from pyspark.sql.types import  LongType, MapType, StringType
    
    data_cols = [x for x in df.columns if x not in {"id", "ref"}]
    
    # Literal map from column name to index
    name_to_index = udf(
        lambda: {x: i for i, x in enumerate(data_cols)},
        MapType(StringType(), LongType())
    )()
    
    # Array of data
    data_array = array(*[col(c) for c in data_cols])
    df.withColumn("out", data_array[name_to_index[col("ref")]])
    
    +---+---+---+---+---+---+---+---+---+
    | id| a1| b1| c1| d1| e1| f1|ref|out|
    +---+---+---+---+---+---+---+---+---+
    |  0|  1| 23|  4|  8|  9|  5| b1| 23|
    |  1|  2| 43|  8| 10| 20| 43| e1| 20|
    +---+---+---+---+---+---+---+---+---+
    

    In 2.x you can skip intermediate objects:

    from pyspark.sql.functions import create_map, lit, col
    from itertools import chain
    
    # Map from column name to column value
    name_to_value = create_map(*chain.from_iterable(
        (lit(c), col(c)) for c in data_cols
    ))
    
    df.withColumn("out", name_to_value[col("ref")])
    
    +---+---+---+---+---+---+---+---+---+
    | id| a1| b1| c1| d1| e1| f1|ref|out|
    +---+---+---+---+---+---+---+---+---+
    |  0|  1| 23|  4|  8|  9|  5| b1| 23|
    |  1|  2| 43|  8| 10| 20| 43| e1| 20|
    +---+---+---+---+---+---+---+---+---+
    

    Finally you can use when:

    from pyspark.sql.functions import col, lit, when
    from functools import reduce
    
    out = reduce(
        lambda acc, x: when(col("ref") == x, col(x)).otherwise(acc), 
        data_cols,
        lit(None)
    )
    
    +---+---+---+---+---+---+---+---+---+
    | id| a1| b1| c1| d1| e1| f1|ref|out|
    +---+---+---+---+---+---+---+---+---+
    |  0|  1| 23|  4|  8|  9|  5| b1| 23|
    |  1|  2| 43|  8| 10| 20| 43| e1| 20|
    +---+---+---+---+---+---+---+---+---+
    

提交回复
热议问题