A complicated sum in R data.table that involves looking at other columns

空扰寡人 提交于 2020-12-30 02:20:04

问题


I have a data table where each value for variables v1 and v2 has an associated "type", coded in a separate column. Here is a MWE:

X <- data.table(id = 1:5, group = c(1,1,2,2,2), v1 = c(10,12,14,16,18), type_v1 = c("t1","t2","t1","t1","t2"), v2 = c(3,NA,NA,7,8), type_v2 = c("t2", "", "", "t3","t3"))
print(X)
   id group v1 type_v1 v2 type_v2
1:  1     1 10      t1  3      t2
2:  2     1 12      t2 NA        
3:  3     2 14      t1 NA        
4:  4     2 16      t1  7      t3
5:  5     2 18      t2  8      t3

I want to sum up the values in columns v1 and v2 for each type by the variable group. The desired output is:

   group v1 type_v1  v2 type_v2  v3 type_v3
1:     1 10      t1  15      t2  NA
2:     2 30      t1  18      t2  15      t3  

There are a lot of different "types", and not all types occur in all groups. I may need to create variables v3, v4, etc. (note how in my example an extra column appeared to accommodate the t1, t2, and t3 in group 2).

My data is currently in the long format. I would prefer not to reshape it to the wide format if possible. I am interested in the solutions that do not involve creating columns "t1", "t2" etc. This is because "t1", "t2" and "t3" are actually very long strings.

Edit: typo in desired output


回答1:


You can melt your data to long format

library(data.table)
X1 <-
  melt(
    X,
    id.vars = "group",
    
    # we melt multiple value vars simultaneously,
    # those starting with "v" and those starting 
    # with "type_v" followed by 1 or more digit
    measure.vars = patterns(c("^v\\d+$", "^type_v\\d+$")),
    value.name = c("value", "type")
  )
X1
#     group variable value type
# 1:     1        1     10   t1
# 2:     1        1     12   t2
# 3:     2        1     14   t1
# 4:     2        1     16   t1
# 5:     2        1     18   t2
# 6:     1        2      3   t2
# 7:     1        2     NA     
# 8:     2        2     NA     
# 9:     2        2      7   t3
#10:     2        2      8   t3

Aggregate the data, excluding empty strings from 'type' column

tmp <- X1[type!="", .("v" = sum(value)), by=.(group, type)]
tmp
#   group type  v
#1:     1   t1 10
#2:     1   t2 15
#3:     2   t1 30
#4:     2   t2 18
#5:     2   t3 15

And finally reshape to wide format again

out <- dcast(tmp, group ~ rowid(group), value.var = c("v", "type")) 
out
#   group v_1 v_2 v_3 type_1 type_2 type_3
#1:     1  10  15  NA     t1     t2   <NA>
#2:     2  30  18  15     t1     t2     t3

If you need the column order to be v_1 | type_1 | v_2 ... you can use setcolorder

tmp2 <- setdiff(names(out), "group")

# create a vector based on the order of the numeric part of 'tmp2'
idx <- order(as.numeric(gsub("\\D", "", tmp2)))
setcolorder(out, c("group", tmp2[idx]))
out
#   group v_1 type_1 v_2 type_2 v_3 type_3
#1:     1  10     t1  15     t2  NA   <NA>
#2:     2  30     t1  18     t2  15     t3


来源:https://stackoverflow.com/questions/64109789/a-complicated-sum-in-r-data-table-that-involves-looking-at-other-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!