I\'m trying to filter a patient database based on specific ICD9 (diagnosis) codes. I would like to use a vector indicating the first 3 strings of the ICD9 codes.
The exa
You can make a regex pattern from the interest vector and apply it to each column of your data frame except for the patient
id, use rowSums
to check if there is any var in a row match the pattern:
library(dplyr)
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
pattern
# [1] "^(866|867)"
filter(observations, rowSums(sapply(observations[-1], grepl, pattern = pattern)) != 0)
# A tibble: 2 × 4
# patient var1 var2 var3
#
#1 a 8661 8651 2430
#2 b 865 8674 3456
Another option is to use Reduce
with lapply
:
filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
# A tibble: 2 × 4
# patient var1 var2 var3
#
#1 a 8661 8651 2430
#2 b 865 8674 3456
This approach works when you have more then two patterns and different patterns have different character length, for instance, if you have dx
as dx<-c("866","867", "9089")
:
dx<-c("866","867", "9089")
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")
pattern
# [1] "^(866|867|9089)"
filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))
# A tibble: 3 × 4
# patient var1 var2 var3
#
#1 a 8661 8651 2430
#2 b 865 8674 3456
#3 c 8651 2866 9089
Check this and this stack answer for more about multiple or conditions in regex.