I have a table like this,
> head(dt2)
Weight Height Fitted interval limit value
1 65.6 174.0 71.91200 pred lwr 53.73165
2 80.7 193.5 91
Let's say you were starting with data that looked like this:
mydf
# Weight Height Fitted interval limit value
# 1 42 153.4 51.0792 conf lwr 49.15463
# 2 42 153.4 51.0792 pred lwr 32.82122
# 3 42 153.4 51.0792 conf upr 53.00376
# 4 42 153.4 51.0792 pred upr 69.33717
# 5 42 153.4 51.0792 conf lwr 60.00000
# 6 42 153.4 51.0792 pred lwr 90.00000
Notice the duplication in rows 5 and 6 of the grouping columns (1 to 5). This is essentially what "tidyr" is telling you. The first row and fifth are duplicates, as are the second and sixth.
tidyr::spread(mydf, limit, value)
# Error: Duplicate identifiers for rows (1, 5), (2, 6)
As suggested by @Jaap, the solution is to first "summarise" the data. Since "tidyr" is only for reshaping data (unlike "reshape2", which aggregated and reshaped), you need to perform the aggregation with "dplyr" before you change the data form. Here, I've done that with summarise
for the "value" column.
If you stopped the execution at the summarise
step, you would find that our original 6-row dataset had "shrunk" to 4 rows. Now, spread
would work as expected.
mydf %>%
group_by(Weight, Height, Fitted, interval, limit) %>%
summarise(value = mean(value)) %>%
spread(limit, value)
# Source: local data frame [2 x 6]
#
# Weight Height Fitted interval lwr upr
# (dbl) (dbl) (dbl) (chr) (dbl) (dbl)
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
This matches the expected output from dcast
with fun.aggregate = mean
.
reshape2::dcast(mydf, Weight + Height + Fitted + interval ~ limit, fun.aggregate = mean)
# Weight Height Fitted interval lwr upr
# 1 42 153.4 51.0792 conf 54.57731 53.00376
# 2 42 153.4 51.0792 pred 61.41061 69.33717
Sample data:
mydf <- structure(list(Weight = c(42, 42, 42, 42, 42, 42), Height = c(153.4,
153.4, 153.4, 153.4, 153.4, 153.4), Fitted = c(51.0792, 51.0792,
51.0792, 51.0792, 51.0792, 51.0792), interval = c("conf", "pred",
"conf", "pred", "conf", "pred"), limit = structure(c(1L, 1L,
2L, 2L, 1L, 1L), .Label = c("lwr", "upr"), class = "factor"),
value = c(49.15463, 32.82122, 53.00376, 69.33717, 60,
90)), .Names = c("Weight", "Height", "Fitted", "interval",
"limit", "value"), row.names = c(NA, 6L), class = "data.frame")
Here are data.table
alternatives to dplyr
. Use mydf
from Ananda's answer.
library(data.table)
library(magrittr)
library(tidyr)
DT <- data.table(mydf)
First, you can use by
to compute the mean by each limit.
DT[, .(lwr = mean(value[limit == "lwr"]),
upr = mean(value[limit == "upr"])),
by = .(Weight, Height, Fitted, interval)]
If this limit == ...
looks too much hard coding, you can first aggregate into a long format, then spread
. This works because once you aggregate, there is no duplicate.
DT[, .(value = mean(value)), by = .(Weight, Height, Fitted, interval, limit)] %>%
spread(key = "limit", value = "value")
Both gets you
# Weight Height Fitted interval lwr upr
#1: 42 153.4 51.0792 conf 54.57731 53.00376
#2: 42 153.4 51.0792 pred 61.41061 69.33717