I have a large data.frame (20000+ entries) in this format:
id D1 D2
1 0.40 0.21
1 0.00 0.00
1 0.53 0.20
2 0.17 0.17
2 0.25 0.25
2 0.55 0.43
Where each id may be duplicated 3-20 times. I would like to merge the duplicated rows into new columns, so my new data.frame looks like:
id D1 D2 D3 D4 D5 D6
1 0.40 0.21 0.00 0.00 0.53 0.20
2 0.17 0.17 0.25 0.25 0.55 0.43
I've manipulated data.frames before with plyr, but I'm not sure how to approach this problem. Any help would be appreciated.Thanks.
The best option would be to just use melt and dcast from "reshape2". But before we jump to that option, let's see what else we have available to us:
You mention that the number of rows per "id" is unbalanced. That would make it somewhat difficult to put into a tidy rectangular data.frame.
Here are a few examples.
Balanced data: Three rows per "id"
mydf <- structure(list(id = c(1, 1, 1, 2, 2, 2),
D1 = c(0.4, 0, 0.53, 0.17, 0.25, 0.55),
D2 = c(0.21, 0, 0.2, 0.17, 0.25, 0.43)),
.Names = c("id", "D1", "D2"), row.names = c(NA, 6L),
class = "data.frame")
mydf
# id D1 D2
# 1 1 0.40 0.21
# 2 1 0.00 0.00
# 3 1 0.53 0.20
# 4 2 0.17 0.17
# 5 2 0.25 0.25
# 6 2 0.55 0.43
With such data, you can just use aggregate:
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# id D1.1 D1.2 D1.3 D2.1 D2.2 D2.3
# 1 1 0.40 0.00 0.53 0.21 0.00 0.20
# 2 2 0.17 0.25 0.55 0.17 0.25 0.43
Unbalanced data: Some workarounds
If you've added a fourth value for "id = 2", aggregate won't work here:
mydf[7, ] <- c(2, .44, .33)
do.call(data.frame, aggregate(. ~ id, mydf, as.vector))
# Error in data.frame(`0` = c(0.4, 0, 0.53), `1` = c(0.17, 0.25, 0.55, 0.44 :
# arguments imply differing number of rows: 3, 4
It might be best to just have a list of the resulting vectors:
lapply(split(mydf[-1], mydf[[1]]), function(x) unlist(x, use.names=FALSE))
# $`1`
# [1] 0.40 0.00 0.53 0.21 0.00 0.20
#
# $`2`
# [1] 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
#
Or, if you insist on a rectangular data.frame, explore one of the several tools to rbind unbalanced data, for example, rbind.fill from "plyr":
library(plyr)
rbind.fill(lapply(split(mydf[-1], mydf[[1]]),
function(x) data.frame(t(unlist(x, use.names=FALSE)))))
# X1 X2 X3 X4 X5 X6 X7 X8
# 1 0.40 0.00 0.53 0.21 0.00 0.20 NA NA
# 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
Unbalanced data: A more direct approach
Alternatively, you can use melt and dcast from "reshape2" as follows:
library(reshape2)
x <- melt(mydf, id.vars = "id")
## ^^ That's not enough information for `dcast`
## We need a "time" variable too, so use `ave`
## to create one according to the number of
## values per ID.
x$time <- ave(x$id, x$id, FUN = seq_along)
## ^^ I would probably actually stop at this point.
## Long data with proper ID and "time" values
## tend to be easier to work with and many
## other functions in R work more nicely with
## this long data format.
dcast(x, id ~ time, value.var = "value")
# id 1 2 3 4 5 6 7 8
# 1 1 0.40 0.00 0.53 0.21 0.00 0.20 NA NA
# 2 2 0.17 0.25 0.55 0.44 0.17 0.25 0.43 0.33
来源:https://stackoverflow.com/questions/17894228/transpose-duplicated-rows-to-column-in-r