Update data.table based on multiple columns and conditions

做~自己de王妃 提交于 2021-01-28 17:35:13

问题


This is a follow-up-question from Efficient way to subset data.table based on value in any of selected columns.

sample data
I have got a data.table with 5 p-columns, indicating a type (type1 or type2 or NA). I also have got 5 r-columns, indicating a score (1-10, or NA).

library(data.table)
set.seed(123)
v  <- c( "type1", "type2", NA_character_ )
v2 <- c( 1:10, rep( NA_integer_, 10 ) )
DT <- data.table( id = 1:100,
                  p1 = sample(v, 100, replace = TRUE ),
                  p2 = sample(v, 100, replace = TRUE ),
                  p3 = sample(v, 100, replace = TRUE ),
                  p4 = sample(v, 100, replace = TRUE ),
                  p5 = sample(v, 100, replace = TRUE ),
                  r1 = sample(v2, 100, replace = TRUE ),
                  r2 = sample(v2, 100, replace = TRUE ),
                  r3 = sample(v2, 100, replace = TRUE ),
                  r4 = sample(v2, 100, replace = TRUE ),
                  r5 = sample(v2, 100, replace = TRUE ))

desired output
I want to create two new columns (one for type1 and one for type2) where I check rowwise if type1/type2 has occured in one or more of the p-columns, and if at least one of the corresponding r-column (p1 -> check r1, p2 -> check r2, etc.) contains a value.

'manual' solution
This can be done like below, using a lot of AND and OR statements:

manual_solution <- DT[ ( p1 == "type1" & !is.na( r1 ) ) |
                         ( p2 == "type1" & !is.na( r2 ) ) |
                         ( p3 == "type1" & !is.na( r3 ) ) |
                         ( p4 == "type1" & !is.na( r4 ) ) |
                         ( p5 == "type1" & !is.na( r5 ) ), 
                       type1_present := "yes"]
manual_solution <- DT[ ( p1 == "type2" & !is.na( r1 ) ) |
                         ( p2 == "type2" & !is.na( r2 ) ) |
                         ( p3 == "type2" & !is.na( r3 ) ) |
                         ( p4 == "type2" & !is.na( r4 ) ) |
                         ( p5 == "type2" & !is.na( r5 ) ), 
                       type2_present := "yes"]
manual_solution[ is.na( type1_present ), type1_present := "no" ]
manual_solution[ is.na( type2_present ), type2_present := "no" ]

Question: automation for dozens of p and r-columns
But looking at the answers from Efficient way to subset data.table based on value in any of selected columns, I'm convinced there are better ways. Especially since my production data contains A LOT more p-columns and r-columns.

So I started playing around, but got stuck pretty fast...

#build vectors p-columns and r-columns
p_cols <- grep( "^p", names( DT ), value = TRUE )
r_cols <- grep( "^r", names( DT ), value = TRUE )

#create logical vectors to test for NA
logi_p <- as.data.table( sapply( DT[, ..p_cols ], function(x) !is.na(x) ) )
logi_r <- as.data.table( sapply( DT[, ..r_cols ], function(x) !is.na(x) ) )

#which non-NA p-values also have a non-NA r-value?
logi <- as.data.table( sapply( logi_p * logi_r, as.logical ) )

And now I havent't got any inspiration left on how to proceed.
Any ideas/suggestions?

bonus
My main focus is on the question above. But my production data also contains a lot more types (in the p-columns)... So a solution that adds a column by type (or can dcast to this result), would 'kill two birds with one stone'.


回答1:


Here is a solution where I convert the type columns into a matrix, update them with information from the r columns and then apply over it searching for the relevant type for as many times as there are types to look for.

# Convert to a matrix
pMAT <- DT[, as.matrix(.SD), .SDcols = patterns("^p")]
# Subset a matrix with another matrix of the r columns
pMAT[] <- pMAT[DT[, as.logical(as.matrix(.SD)), .SDcols = patterns("^r")]]

types2check <- c("type1", "type2")
for (t in types2check) {
  set(
    x = DT, 
    j = paste0(t, "_present"), 
    value = ifelse(apply(pMAT, 1, function(x) any(x == t, na.rm = TRUE)), "yes", "no")
  )
}

Extra

Playing with dcast() you could do something like the following. Pipes are there just for readability and some of the steps can probably be simplified.

result <- data.table(id = DT[["id"]], stack(DT[, ..pcols]), stack(DT[, ..rcols])) %>%
  setnames(c("id", "type", "pind", "rval", "rind")) %>% 
  .[, .(type = type[as.logical(rval)], id)] %>% 
  dcast(id ~ type, value.var = "id", fill = "no", fun.aggregate = function(x) if (length(x)) "yes") %>% 
  .[, `NA` := NULL]

> head(result)
   id type1 type2
1:  1   yes   yes
2:  2   yes    no
3:  3    no   yes
4:  4    no    no
5:  5    no   yes
6:  6   yes    no


来源:https://stackoverflow.com/questions/54957137/update-data-table-based-on-multiple-columns-and-conditions

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!