pyspark dynamic column computation

后端 未结 2 1892
长发绾君心
长发绾君心 2021-01-27 18:20

Below is my spark data frame

a b c
1 3 4
2 0 0
4 1 0
2 2 0

My output should be as below

a b c
1 3 4
2 0 2
4 1 -1
2 2 3
<         


        
2条回答
  •  长发绾君心
    2021-01-27 19:17

    from pyspark.sql.functions import lag, udf
    from pyspark.sql.types import IntegerType
    from pyspark.sql.window import Window
    
    numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
    df = sc.parallelize(numbers).toDF(['a','b','c'])
    df.show()
    
    w = Window().partitionBy().orderBy('a')
    calculate = udf(lambda a,b,c:a-b+c,IntegerType())
    df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
    df.show()
    
    
    
    +---+---+---+
    |  a|  b|  c|
    +---+---+---+
    |  1|  2|  3|
    |  2|  3|  4|
    |  3|  4|  5|
    |  5|  6|  7|
    +---+---+---+
    
    +---+---+---+------+
    |  a|  b|  c|result|
    +---+---+---+------+
    |  1|  2|  3|  null|
    |  2|  3|  4|     2|
    |  3|  4|  5|     3|
    |  5|  6|  7|     4|
    +---+---+---+------+
    

提交回复
热议问题