问题
I have a data table where each value for variables v1
and v2
has an associated "type", coded in a separate column. Here is a MWE:
X <- data.table(id = 1:5, group = c(1,1,2,2,2), v1 = c(10,12,14,16,18), type_v1 = c("t1","t2","t1","t1","t2"), v2 = c(3,NA,NA,7,8), type_v2 = c("t2", "", "", "t3","t3"))
print(X)
id group v1 type_v1 v2 type_v2
1: 1 1 10 t1 3 t2
2: 2 1 12 t2 NA
3: 3 2 14 t1 NA
4: 4 2 16 t1 7 t3
5: 5 2 18 t2 8 t3
I want to sum up the values in columns v1
and v2
for each type by the variable group
. The desired output is:
group v1 type_v1 v2 type_v2 v3 type_v3
1: 1 10 t1 15 t2 NA
2: 2 30 t1 18 t2 15 t3
There are a lot of different "types", and not all types occur in all groups. I may need to create variables v3
, v4
, etc. (note how in my example an extra column appeared to accommodate the t1, t2, and t3 in group 2).
My data is currently in the long format. I would prefer not to reshape it to the wide format if possible. I am interested in the solutions that do not involve creating columns "t1", "t2" etc. This is because "t1", "t2" and "t3" are actually very long strings.
Edit: typo in desired output
回答1:
You can melt
your data to long format
library(data.table)
X1 <-
melt(
X,
id.vars = "group",
# we melt multiple value vars simultaneously,
# those starting with "v" and those starting
# with "type_v" followed by 1 or more digit
measure.vars = patterns(c("^v\\d+$", "^type_v\\d+$")),
value.name = c("value", "type")
)
X1
# group variable value type
# 1: 1 1 10 t1
# 2: 1 1 12 t2
# 3: 2 1 14 t1
# 4: 2 1 16 t1
# 5: 2 1 18 t2
# 6: 1 2 3 t2
# 7: 1 2 NA
# 8: 2 2 NA
# 9: 2 2 7 t3
#10: 2 2 8 t3
Aggregate the data, excluding empty strings from 'type' column
tmp <- X1[type!="", .("v" = sum(value)), by=.(group, type)]
tmp
# group type v
#1: 1 t1 10
#2: 1 t2 15
#3: 2 t1 30
#4: 2 t2 18
#5: 2 t3 15
And finally reshape to wide format again
out <- dcast(tmp, group ~ rowid(group), value.var = c("v", "type"))
out
# group v_1 v_2 v_3 type_1 type_2 type_3
#1: 1 10 15 NA t1 t2 <NA>
#2: 2 30 18 15 t1 t2 t3
If you need the column order to be v_1 | type_1 | v_2 ...
you can use setcolorder
tmp2 <- setdiff(names(out), "group")
# create a vector based on the order of the numeric part of 'tmp2'
idx <- order(as.numeric(gsub("\\D", "", tmp2)))
setcolorder(out, c("group", tmp2[idx]))
out
# group v_1 type_1 v_2 type_2 v_3 type_3
#1: 1 10 t1 15 t2 NA <NA>
#2: 2 30 t1 18 t2 15 t3
来源:https://stackoverflow.com/questions/64109789/a-complicated-sum-in-r-data-table-that-involves-looking-at-other-columns