ddply with fixed number of rows

邮差的信 提交于 2019-12-19 11:06:39

问题


I want to break up my data by 'number of rows'. That is to say I want to send a fixed number of rows to my function and when I get to the end of the data frame (last chunk) I need to just send the chunk whether it has the fixed number of rows or less. Something like this:

ddply(df, .(8 rows), .fun=somefunction)

回答1:


If you want to use plyr you can add a category column:

df <- data.frame(x=rnorm(100), y=rnorm(100))

somefunction <- function(df) {
    data.frame(mean(df$x), mean(df$y))
}

df$category <- rep(letters[1:10], each=10)

ddply(df, .(category), somefunction)

But, the apply family might be a better option in this case:

somefunction <- function(n, x, y) {
    data.frame(mean(x[n:(n+9)]), mean(y[n:n+9]))
}

lapply(seq(1, nrow(df), by=10), somefunction, x=df$x, y=df$y)



回答2:


If speed and brevity is of interest then for completeness (and using a chunk size of 4 rather than 8 to keep the example short) :

require(data.table)
set.seed(0)
DT = data.table(a=rnorm(10))
DT
                 a
 [1,]  1.262954285
 [2,] -0.326233361
 [3,]  1.329799263
 [4,]  1.272429321
 [5,]  0.414641434
 [6,] -1.539950042
 [7,] -0.928567035
 [8,] -0.294720447
 [9,] -0.005767173
[10,]  2.404653389

DT[,list(sum=sum(a),groupsize=.N),by=list(chunk=(0:(nrow(DT)-1))%/%4)]
     chunk       sum groupsize
[1,]     0  3.538950         4
[2,]     1 -2.348596         4
[3,]     2  2.398886         2

Admitedly, that's quite a long statement. It names the columns and returns the group size too to show you that the last chunk really does include just 2 rows as required, though.

Once comfortable it's doing the right thing, it can be shortened to this :

DT[,sum(a),by=list(chunk=(0:(nrow(DT)-1))%/%4)]
     chunk        V1
[1,]     0  3.538950
[2,]     1 -2.348596
[3,]     2  2.398886

Notice that you can do on the fly aggregations like that; they don't need to be added to the data first. If you have a lot of different aggregations in a production script, or just want to interact with the data from the command line, then very small productivity differences like this can sometimes help, depending on your workflow.

NB: I picked sum but that could be replaced with somefunction(.SD) or (more likely) just list(exp1,exp2,...) where each exp is any R expression that sees column names as variable names.




回答3:


You can define the 8 row ID within the call to ddply.

Not particularly elegant, but using ddply (and head for the example function)

df <- data.frame(x = rnorm(100), y = rnorm(100))
ddply(df, .(row_id = rep(seq(ceiling(nrow(df) / 8)), each = 8)[1:nrow(df)]),
             head, n = 1)


来源:https://stackoverflow.com/questions/10837258/ddply-with-fixed-number-of-rows

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!