SparkR window function

拜拜、爱过 提交于 2019-11-28 00:34:54

Spark 2.0.0+

SparkR provides DSL wrappers with over, window.partitionBy / partitionBy, window.orderBy / orderBy and rowsBetween / rangeBeteen functions.

Spark <= 1.6

Unfortunately it is not possible in 1.6.0. While some window functions, including lag, have been implemented SparkR doesn't support window definitions yet which renders these completely useless.

As long as SPARK-11395 is not resolved the only option is to use raw SQL:

set.seed(1)

hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")

sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") %>% 
  head()

##    x y          z        _c3
## 1  1 1 -0.6264538         NA
## 2  4 1  1.5952808 -0.6264538
## 3  7 1  0.4874291  1.5952808
## 4 10 1 -0.3053884  0.4874291
## 5  2 2  0.1836433         NA
## 6  5 2  0.3295078  0.1836433

Assuming that the corresponding PR will be merged without significant changes window definition and example query should look as follows:

w <- Window.partitionBy("y") %>% orderBy("x")
select(sdf, over(lag(sdf$z), w))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!