Filter rows based on variables “beginning with” strings specified by vector

前端 未结 3 623
梦如初夏
梦如初夏 2021-01-22 08:00

I\'m trying to filter a patient database based on specific ICD9 (diagnosis) codes. I would like to use a vector indicating the first 3 strings of the ICD9 codes.

The exa

3条回答
  •  耶瑟儿~
    2021-01-22 08:30

    You can make a regex pattern from the interest vector and apply it to each column of your data frame except for the patient id, use rowSums to check if there is any var in a row match the pattern:

    library(dplyr)
    pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
    
    pattern
    # [1] "^(866|867)"
    
    filter(observations, rowSums(sapply(observations[-1], grepl, pattern = pattern)) != 0)
    
    # A tibble: 2 × 4
    #  patient  var1  var2  var3
    #       
    #1       a  8661  8651  2430
    #2       b   865  8674  3456
    

    Another option is to use Reduce with lapply:

    filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
    
    # A tibble: 2 × 4
    #  patient  var1  var2  var3
    #       
    #1       a  8661  8651  2430
    #2       b   865  8674  3456
    

    This approach works when you have more then two patterns and different patterns have different character length, for instance, if you have dx as dx<-c("866","867", "9089"):

    dx<-c("866","867", "9089")
    pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
    
    pattern
    # [1] "^(866|867|9089)"
    
    filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
    
    # A tibble: 3 × 4
    #  patient  var1  var2  var3
    #       
    #1       a  8661  8651  2430
    #2       b   865  8674  3456
    #3       c  8651  2866  9089
    

    Check this and this stack answer for more about multiple or conditions in regex.

提交回复
热议问题