Find multiple strings in entire dataframe

人走茶凉 提交于 2019-12-24 01:17:24

问题


I am trying to find multiple strings in my dataframe, using the which function. I am trying to extend the answer from Find string in data.frame

An example dataframe is:

df1 <- data.frame(animal=c('a','b','c','two', 'five', 'c'), level=c('five','one','three',30,'horse', 'five'), length=c(10, 20, 30, 'horse', 'eight', 'c'))

1      a  five     10
2      b   one     20
3      c three     30
4    two    30  horse
5   five horse  eight
6      c  five      c 

on this dataframe when I apply the which function for one string, I get the correct output e.g. which(df1 =="c" , arr.ind = T);df1 gives:

  row col
[1,]   3   1
[2,]   6   1
[3,]   6   3

But when I try to search for multiple strings, I get only a partially correct output e.g. which(df1 ==c("c", "horse", "five") , arr.ind = T)

  row col
[1,]   5   2
[2,]   6   2

The expected output should be:

     row col
[1,]   3   1
[2,]   5   1
[3,]   6   1
[4,]   1   2
[5,]   5   2
[6,]   6   2
[7,]   4   3
[8,]   6   3

Hence my question:

  1. why does the solution with c("c", "horse", "five") not work?

  2. I have tried with

which(df1=="c" | df1=="horse" | df1 =="five", arr.ind = T)

that gives me the correct output, but for many strings is too lengthy, how can I make my code succinct?


回答1:


We can loop through the vector with lapply, do the ==, Reduce it to single logical matrix with | and wrap with which

which(Reduce(`|`, lapply(c("c", "horse", "five"), `==`, df1)), arr.ind = TRUE)
#     row col
#[1,]   3   1
#[2,]   5   1
#[3,]   6   1
#[4,]   1   2
#[5,]   5   2
#[6,]   6   2
#[7,]   4   3
#[8,]   6   3

Or another option would be to loop through the columns of dataset with mutate_all and wrap with which

library(dplyr)
df1 %>%
  mutate_all(list(~ . %in% c("c", "horse", "five"))) %>%
  as.matrix %>% 
  which(., arr.ind = TRUE)

NOTE: Here, we don't need any regex or partial matches if the OP wanted to do a full string match. It should be faster than doing any partial matches


Usually, for multiple elements %in% would be useful, but, it works only on a vector and not a data.frame




回答2:


Since you have multiple values you cannot directly compare them in a dataframe. One way is to use sapply with grepl by creating word boundaries and check if the pattern is present in any of the columns and then use which to get row and column indices.

vals <- c("c", "horse", "five") 

which(sapply(df1, grepl, pattern = paste0("\\b", vals, "\\b", collapse = "|")), 
      arr.ind = TRUE)

#     row col
#[1,]   3   1
#[2,]   5   1
#[3,]   6   1
#[4,]   1   2
#[5,]   5   2
#[6,]   6   2
#[7,]   4   3
#[8,]   6   3


来源:https://stackoverflow.com/questions/56583161/find-multiple-strings-in-entire-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!