pyspark dynamic column computation

后端 未结 2 1893
长发绾君心
长发绾君心 2021-01-27 18:20

Below is my spark data frame

a b c
1 3 4
2 0 0
4 1 0
2 2 0

My output should be as below

a b c
1 3 4
2 0 2
4 1 -1
2 2 3
<         


        
相关标签:
2条回答
  • 2021-01-27 19:14

    Hope this might help!

    import pyspark.sql.functions as f
    from pyspark.sql.window import Window
    
    df = sc.parallelize([
        [1,3],
        [2,0],
        [4,1],
        [2,2]
    ]).toDF(('a', 'b'))
    
    df1 = df.withColumn("row_id", f.monotonically_increasing_id())
    w = Window.partitionBy().orderBy(f.col("row_id"))
    df1 = df1.withColumn("c_temp", f.when(f.col("row_id")==0, f.lit(4)).otherwise(- f.col("a") + f.col("b")))
    df1 = df1.withColumn("c", f.sum(f.col("c_temp")).over(w)).drop("c_temp","row_id")
    df1.show()
    

    Output is:

    +---+---+---+
    |  a|  b|  c|
    +---+---+---+
    |  1|  3|  4|
    |  2|  0|  2|
    |  4|  1| -1|
    |  2|  2| -1|
    +---+---+---+
    
    0 讨论(0)
  • 2021-01-27 19:17
    from pyspark.sql.functions import lag, udf
    from pyspark.sql.types import IntegerType
    from pyspark.sql.window import Window
    
    numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
    df = sc.parallelize(numbers).toDF(['a','b','c'])
    df.show()
    
    w = Window().partitionBy().orderBy('a')
    calculate = udf(lambda a,b,c:a-b+c,IntegerType())
    df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
    df.show()
    
    
    
    +---+---+---+
    |  a|  b|  c|
    +---+---+---+
    |  1|  2|  3|
    |  2|  3|  4|
    |  3|  4|  5|
    |  5|  6|  7|
    +---+---+---+
    
    +---+---+---+------+
    |  a|  b|  c|result|
    +---+---+---+------+
    |  1|  2|  3|  null|
    |  2|  3|  4|     2|
    |  3|  4|  5|     3|
    |  5|  6|  7|     4|
    +---+---+---+------+
    
    0 讨论(0)
提交回复
热议问题