pyspark dynamic column computation

后端未结

关注

 2  1908

长发绾君心 2021-01-27 18:20

Below is my spark data frame

My output should be as below

2条回答

感动是毒 (楼主)

2021-01-27 19:14

Hope this might help!

import pyspark.sql.functions as f
from pyspark.sql.window import Window

df = sc.parallelize([
    [1,3],
    [2,0],
    [4,1],
    [2,2]
]).toDF(('a', 'b'))

df1 = df.withColumn("row_id", f.monotonically_increasing_id())
w = Window.partitionBy().orderBy(f.col("row_id"))
df1 = df1.withColumn("c_temp", f.when(f.col("row_id")==0, f.lit(4)).otherwise(- f.col("a") + f.col("b")))
df1 = df1.withColumn("c", f.sum(f.col("c_temp")).over(w)).drop("c_temp","row_id")
df1.show()

Output is:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  3|  4|
|  2|  0|  2|
|  4|  1| -1|
|  2|  2| -1|
+---+---+---+

0 讨论(0)

查看其它2个回答