Find multiple strings in entire dataframe

人走茶凉 提交于 2019-12-24 01:17:24


I am trying to find multiple strings in my dataframe, using the which function. I am trying to extend the answer from Find string in data.frame

An example dataframe is:

df1 <- data.frame(animal=c('a','b','c','two', 'five', 'c'), level=c('five','one','three',30,'horse', 'five'), length=c(10, 20, 30, 'horse', 'eight', 'c'))

1      a  five     10
2      b   one     20
3      c three     30
4    two    30  horse
5   five horse  eight
6      c  five      c 

on this dataframe when I apply the which function for one string, I get the correct output e.g. which(df1 =="c" , arr.ind = T);df1 gives:

  row col
[1,]   3   1
[2,]   6   1
[3,]   6   3

But when I try to search for multiple strings, I get only a partially correct output e.g. which(df1 ==c("c", "horse", "five") , arr.ind = T)

  row col
[1,]   5   2
[2,]   6   2

The expected output should be:

     row col
[1,]   3   1
[2,]   5   1
[3,]   6   1
[4,]   1   2
[5,]   5   2
[6,]   6   2
[7,]   4   3
[8,]   6   3

Hence my question:

  1. why does the solution with c("c", "horse", "five") not work?

  2. I have tried with

which(df1=="c" | df1=="horse" | df1 =="five", arr.ind = T)

that gives me the correct output, but for many strings is too lengthy, how can I make my code succinct?


We can loop through the vector with lapply, do the ==, Reduce it to single logical matrix with | and wrap with which

which(Reduce(`|`, lapply(c("c", "horse", "five"), `==`, df1)), arr.ind = TRUE)
#     row col
#[1,]   3   1
#[2,]   5   1
#[3,]   6   1
#[4,]   1   2
#[5,]   5   2
#[6,]   6   2
#[7,]   4   3
#[8,]   6   3

Or another option would be to loop through the columns of dataset with mutate_all and wrap with which

df1 %>%
  mutate_all(list(~ . %in% c("c", "horse", "five"))) %>%
  as.matrix %>% 
  which(., arr.ind = TRUE)

NOTE: Here, we don't need any regex or partial matches if the OP wanted to do a full string match. It should be faster than doing any partial matches

Usually, for multiple elements %in% would be useful, but, it works only on a vector and not a data.frame


Since you have multiple values you cannot directly compare them in a dataframe. One way is to use sapply with grepl by creating word boundaries and check if the pattern is present in any of the columns and then use which to get row and column indices.

vals <- c("c", "horse", "five") 

which(sapply(df1, grepl, pattern = paste0("\\b", vals, "\\b", collapse = "|")), 
      arr.ind = TRUE)

#     row col
#[1,]   3   1
#[2,]   5   1
#[3,]   6   1
#[4,]   1   2
#[5,]   5   2
#[6,]   6   2
#[7,]   4   3
#[8,]   6   3

