How to divide the value of current row with the following one?

≡放荡痞女 提交于 2019-12-07 13:46:15

问题


In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?

For example, if I have a table with one column, like so

Age
100
50
20
4

I'd like the following output

Franction
2
2.5
5

The last row is dropped because it has no "next row" to be added to.

Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.

Is there a better way to do this? Can this be done with a Window function?


回答1:


Window function should do only partial tricks. Other partial trick can be done by defining a udf function

def div = udf((age: Double, lag: Double) => lag/age)

First we need to find the lag using Window function and then pass that lag and age in udf function to find the div import sqlContext.implicits._ import org.apache.spark.sql.functions._

val dataframe = Seq(
  ("A",100),
  ("A",50),
  ("A",20),
  ("A",4)
).toDF("person", "Age")

val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))

And finally cal the udf function

newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show

Final output would be

+------+---+
|person|div|
+------+---+
|     A|2.0|
|     A|2.5|
|     A|5.0|
+------+---+

Edited As @Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function

newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show


来源:https://stackoverflow.com/questions/44392754/how-to-divide-the-value-of-current-row-with-the-following-one

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!