pyspark dynamic column computation

后端未结

关注

 2  1892

长发绾君心 2021-01-27 18:20

Below is my spark data frame

My output should be as below

2条回答

长发绾君心 (楼主)

2021-01-27 19:17

from pyspark.sql.functions import lag, udf
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window

numbers = [[1,2,3],[2,3,4],[3,4,5],[5,6,7]]
df = sc.parallelize(numbers).toDF(['a','b','c'])
df.show()

w = Window().partitionBy().orderBy('a')
calculate = udf(lambda a,b,c:a-b+c,IntegerType())
df = df.withColumn('result', lag("a").over(w)-df.b+df.c)
df.show()



+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  2|  3|  4|
|  3|  4|  5|
|  5|  6|  7|
+---+---+---+

+---+---+---+------+
|  a|  b|  c|result|
+---+---+---+------+
|  1|  2|  3|  null|
|  2|  3|  4|     2|
|  3|  4|  5|     3|
|  5|  6|  7|     4|
+---+---+---+------+

0 讨论(0)

查看其它2个回答