SparkR: dplyr-style split-apply-combine on DataFrame

你。 提交于 2019-12-21 06:05:28

问题


Under the previous RDD paradigm, I could specify a key and then map an operation to RDD elements corresponding to each key. I don't see a clear way to do this with DataFrame in SparkR as of 1.5.1. What I would like to do is something like a dplyr operation:

new.df <- old.df %>%
  group_by("column1") %>%
  do(myfunc(.))

I currently have a large SparkR DataFrame of the form:

            timestamp  value  id
2015-09-01 05:00:00.0  1.132  24
2015-09-01 05:10:00.0  null   24
2015-09-01 05:20:00.0  1.129  24
2015-09-01 05:00:00.0  1.131  47
2015-09-01 05:10:00.0  1.132  47
2015-09-01 05:10:00.0  null   47

I've sorted by id and timestamp.

I want to group by id, but I don't want to aggregate. Instead I want to do a set of transformations and computations on each group -- for example, interpolating to fill NAs (which are generated when I collect the DataFrame and then convert value to numeric). I've tested using agg, but while my computations do appear to run, the results aren't returned, because I'm not returning a single value in myfunc:

library(zoo)

myfunc <- function(df) {

  df.loc <- collect(df)
  df.loc$value <- as.numeric(df.loc$value)
  df.loc$newparam <- na.approx(df.loc$value, na.rm = FALSE)
  return(df.loc)

  # I also tested return(createDataFrame(sqlContext, df.loc)) here

}

df <- read.df( # some stuff )

grp <- group_by(df, "id")

test <- agg(grp, "myfunc")

15/11/11 18:45:33 INFO scheduler.DAGScheduler: Job 2 finished: dfToCols at NativeMethodAccessorImpl.java:-2, took 0.463131 s
   id
1  24
2  47

Note that the operations in myfunc all work correctly when I filter the DataFrame down to a single id and run it. Based on the time it takes to run (about 50 seconds per task) and the fact that no exceptions are thrown, I believe myfunc is indeed being run on all of the ids -- but I need the output!

Any input would be appreciated.

来源:https://stackoverflow.com/questions/33657974/sparkr-dplyr-style-split-apply-combine-on-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!