Update data.table based on multiple columns and conditions

问题

This is a follow-up-question from Efficient way to subset data.table based on value in any of selected columns.

sample data
I have got a data.table with 5 p-columns, indicating a type (type1 or type2 or NA). I also have got 5 r-columns, indicating a score (1-10, or NA).

library(data.table)
set.seed(123)
v  <- c( "type1", "type2", NA_character_ )
v2 <- c( 1:10, rep( NA_integer_, 10 ) )
DT <- data.table( id = 1:100,
                  p1 = sample(v, 100, replace = TRUE ),
                  p2 = sample(v, 100, replace = TRUE ),
                  p3 = sample(v, 100, replace = TRUE ),
                  p4 = sample(v, 100, replace = TRUE ),
                  p5 = sample(v, 100, replace = TRUE ),
                  r1 = sample(v2, 100, replace = TRUE ),
                  r2 = sample(v2, 100, replace = TRUE ),
                  r3 = sample(v2, 100, replace = TRUE ),
                  r4 = sample(v2, 100, replace = TRUE ),
                  r5 = sample(v2, 100, replace = TRUE ))

desired output
I want to create two new columns (one for type1 and one for type2) where I check rowwise if type1/type2 has occured in one or more of the p-columns, and if at least one of the corresponding r-column (p1 -> check r1, p2 -> check r2, etc.) contains a value.

'manual' solution
This can be done like below, using a lot of AND and OR statements:

manual_solution <- DT[ ( p1 == "type1" & !is.na( r1 ) ) |
                         ( p2 == "type1" & !is.na( r2 ) ) |
                         ( p3 == "type1" & !is.na( r3 ) ) |
                         ( p4 == "type1" & !is.na( r4 ) ) |
                         ( p5 == "type1" & !is.na( r5 ) ), 
                       type1_present := "yes"]
manual_solution <- DT[ ( p1 == "type2" & !is.na( r1 ) ) |
                         ( p2 == "type2" & !is.na( r2 ) ) |
                         ( p3 == "type2" & !is.na( r3 ) ) |
                         ( p4 == "type2" & !is.na( r4 ) ) |
                         ( p5 == "type2" & !is.na( r5 ) ), 
                       type2_present := "yes"]
manual_solution[ is.na( type1_present ), type1_present := "no" ]
manual_solution[ is.na( type2_present ), type2_present := "no" ]

Question: automation for dozens of p and r-columns
But looking at the answers from Efficient way to subset data.table based on value in any of selected columns, I'm convinced there are better ways. Especially since my production data contains A LOT more p-columns and r-columns.

So I started playing around, but got stuck pretty fast...

#build vectors p-columns and r-columns
p_cols <- grep( "^p", names( DT ), value = TRUE )
r_cols <- grep( "^r", names( DT ), value = TRUE )

#create logical vectors to test for NA
logi_p <- as.data.table( sapply( DT[, ..p_cols ], function(x) !is.na(x) ) )
logi_r <- as.data.table( sapply( DT[, ..r_cols ], function(x) !is.na(x) ) )

#which non-NA p-values also have a non-NA r-value?
logi <- as.data.table( sapply( logi_p * logi_r, as.logical ) )

And now I havent't got any inspiration left on how to proceed.
Any ideas/suggestions?

bonus
My main focus is on the question above. But my production data also contains a lot more types (in the p-columns)... So a solution that adds a column by type (or can dcast to this result), would 'kill two birds with one stone'.

回答1:

Here is a solution where I convert the type columns into a matrix, update them with information from the r columns and then apply over it searching for the relevant type for as many times as there are types to look for.

# Convert to a matrix
pMAT <- DT[, as.matrix(.SD), .SDcols = patterns("^p")]
# Subset a matrix with another matrix of the r columns
pMAT[] <- pMAT[DT[, as.logical(as.matrix(.SD)), .SDcols = patterns("^r")]]

types2check <- c("type1", "type2")
for (t in types2check) {
  set(
    x = DT, 
    j = paste0(t, "_present"), 
    value = ifelse(apply(pMAT, 1, function(x) any(x == t, na.rm = TRUE)), "yes", "no")
  )
}

Extra

Playing with dcast() you could do something like the following. Pipes are there just for readability and some of the steps can probably be simplified.

result <- data.table(id = DT[["id"]], stack(DT[, ..pcols]), stack(DT[, ..rcols])) %>%
  setnames(c("id", "type", "pind", "rval", "rind")) %>% 
  .[, .(type = type[as.logical(rval)], id)] %>% 
  dcast(id ~ type, value.var = "id", fill = "no", fun.aggregate = function(x) if (length(x)) "yes") %>% 
  .[, `NA` := NULL]

> head(result)
   id type1 type2
1:  1   yes   yes
2:  2   yes    no
3:  3    no   yes
4:  4    no    no
5:  5    no   yes
6:  6   yes    no

来源：https://stackoverflow.com/questions/54957137/update-data-table-based-on-multiple-columns-and-conditions

标签

data.table