问题
I have a data table where each value for variables v1 and v2 has an associated "type", coded in a separate column. Here is a MWE:
X <- data.table(id = 1:5, group = c(1,1,2,2,2), v1 = c(10,12,14,16,18), type_v1 = c("t1","t2","t1","t1","t2"), v2 = c(3,NA,NA,7,8), type_v2 = c("t2", "", "", "t3","t3"))
print(X)
id group v1 type_v1 v2 type_v2
1: 1 1 10 t1 3 t2
2: 2 1 12 t2 NA
3: 3 2 14 t1 NA
4: 4 2 16 t1 7 t3
5: 5 2 18 t2 8 t3
I want to sum up the values in columns v1 and v2 for each type by the variable group. The desired output is:
group v1 type_v1 v2 type_v2 v3 type_v3
1: 1 10 t1 15 t2 NA
2: 2 30 t1 18 t2 15 t3
There are a lot of different "types", and not all types occur in all groups. I may need to create variables v3, v4, etc. (note how in my example an extra column appeared to accommodate the t1, t2, and t3 in group 2).
My data is currently in the long format. I would prefer not to reshape it to the wide format if possible. I am interested in the solutions that do not involve creating columns "t1", "t2" etc. This is because "t1", "t2" and "t3" are actually very long strings.
Edit: typo in desired output
回答1:
You can melt your data to long format
library(data.table)
X1 <-
melt(
X,
id.vars = "group",
# we melt multiple value vars simultaneously,
# those starting with "v" and those starting
# with "type_v" followed by 1 or more digit
measure.vars = patterns(c("^v\\d+$", "^type_v\\d+$")),
value.name = c("value", "type")
)
X1
# group variable value type
# 1: 1 1 10 t1
# 2: 1 1 12 t2
# 3: 2 1 14 t1
# 4: 2 1 16 t1
# 5: 2 1 18 t2
# 6: 1 2 3 t2
# 7: 1 2 NA
# 8: 2 2 NA
# 9: 2 2 7 t3
#10: 2 2 8 t3
Aggregate the data, excluding empty strings from 'type' column
tmp <- X1[type!="", .("v" = sum(value)), by=.(group, type)]
tmp
# group type v
#1: 1 t1 10
#2: 1 t2 15
#3: 2 t1 30
#4: 2 t2 18
#5: 2 t3 15
And finally reshape to wide format again
out <- dcast(tmp, group ~ rowid(group), value.var = c("v", "type"))
out
# group v_1 v_2 v_3 type_1 type_2 type_3
#1: 1 10 15 NA t1 t2 <NA>
#2: 2 30 18 15 t1 t2 t3
If you need the column order to be v_1 | type_1 | v_2 ... you can use setcolorder
tmp2 <- setdiff(names(out), "group")
# create a vector based on the order of the numeric part of 'tmp2'
idx <- order(as.numeric(gsub("\\D", "", tmp2)))
setcolorder(out, c("group", tmp2[idx]))
out
# group v_1 type_1 v_2 type_2 v_3 type_3
#1: 1 10 t1 15 t2 NA <NA>
#2: 2 30 t1 18 t2 15 t3
来源:https://stackoverflow.com/questions/64109789/a-complicated-sum-in-r-data-table-that-involves-looking-at-other-columns