pyspark dynamic column computation

后端 未结 2 1908
长发绾君心
长发绾君心 2021-01-27 18:20

Below is my spark data frame

a b c
1 3 4
2 0 0
4 1 0
2 2 0

My output should be as below

a b c
1 3 4
2 0 2
4 1 -1
2 2 3
<         


        
2条回答
  •  感动是毒
    2021-01-27 19:14

    Hope this might help!

    import pyspark.sql.functions as f
    from pyspark.sql.window import Window
    
    df = sc.parallelize([
        [1,3],
        [2,0],
        [4,1],
        [2,2]
    ]).toDF(('a', 'b'))
    
    df1 = df.withColumn("row_id", f.monotonically_increasing_id())
    w = Window.partitionBy().orderBy(f.col("row_id"))
    df1 = df1.withColumn("c_temp", f.when(f.col("row_id")==0, f.lit(4)).otherwise(- f.col("a") + f.col("b")))
    df1 = df1.withColumn("c", f.sum(f.col("c_temp")).over(w)).drop("c_temp","row_id")
    df1.show()
    

    Output is:

    +---+---+---+
    |  a|  b|  c|
    +---+---+---+
    |  1|  3|  4|
    |  2|  0|  2|
    |  4|  1| -1|
    |  2|  2| -1|
    +---+---+---+
    

提交回复
热议问题