Difference of two character vectors with substring

前端 未结 3 678
死守一世寂寞
死守一世寂寞 2020-12-11 09:09

I have two lists:

a <- c(\"da\", \"ba\", \"cs\", \"dd\", \"ek\")
b <- c(\"zyc\", \"ulk\", \"mae\", \"csh\", \"ddi\", \"dada\")

I want

相关标签:
3条回答
  • 2020-12-11 09:25

    You could try the following:

    b[!(+(apply(sapply(a, function(x) grepl(x,b)),1,sum)) > 0)]
    [1] "zyc" "ulk" "mae"
    

    'Peeling' this previous call from the inside, the results are the following: First, obtain a matrix of matches from the grepl: call (with sapply):

    sapply(a, function(x) grepl(x,b))
    #        da    ba    cs    dd    ek
    #[1,] FALSE FALSE FALSE FALSE FALSE
    #[2,] FALSE FALSE FALSE FALSE FALSE
    #[3,] FALSE FALSE FALSE FALSE FALSE
    #[4,] FALSE FALSE  TRUE FALSE FALSE
    #[5,] FALSE FALSE FALSE  TRUE FALSE
    #[6,]  TRUE FALSE FALSE FALSE FALSE
    

    Note that the columns are the elements of a and the rows are the elements of b.

    Then, apply the function sum per rows (in R, TRUE is 1 and FALSE is 0:

    apply(sapply(a, function(x) grepl(x,b)),1,sum)
    #[1] 0 0 0 1 1 1
    

    Note that here, the row sums might be > 1 (if there is more than 1 match), so it must be coerced into a logical with the previous call wrapped around:

    +() > 0
    

    With this, we can match ([) the indices of b, but since we want the opposite, we use the operator !.

    #full code:
    step.one <- sapply(a, function(x) grepl(x,b))
    step.two <- apply(step.one,1,sum)
    step.three <- +(step.two > 0)
    step.four <- !step.three
    #finally:
    b[step.four]
    

    As David shows in the comments, this is a much more elegant approach:

    b[-which(sapply(a, grepl, b), arr.ind = TRUE)[, "row"]]
    
    0 讨论(0)
  • 2020-12-11 09:31

    And another solution using a simple for loop:

    sel <- rep(FALSE, length(b))
    for (i in seq_along(a)) {
      sel <- sel | grepl(a[i], b, fixed = TRUE)
    }
    b[!sel]
    

    Not as elegant as some as the other solutions (especially the one by akrun), but showing that a for loop isn't always as slow in R as people believe:

    fun1 <- function(a, b) {
      sel <- rep(FALSE, length(b))
      for (i in seq_along(a)) {
        sel <- sel | grepl(a[i], b, fixed = TRUE)
      }
      b[!sel]
    }
    
    fun2 <- function(a, b) {
      b[!apply(sapply(a, function(x) grepl(x,b, fixed=TRUE)),1,sum)]
    }
    
    fun3 <- function(a, b) {
      b[-which(sapply(a, grepl, b, fixed=TRUE), arr.ind = TRUE)[, "row"]]
    }
    
    fun4 <- function(a, b) {
      b[!grepl(paste(a, collapse="|"), b)]
    }
    
    library(stringr)
    fun5 <- function(a, b) {
      b[!sapply(b, function(u) any(str_detect(u,a)))]
    }
    
    a <- c("da", "ba", "cs", "dd", "ek")
    b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
    b <- rep(b, length.out = 1E3)
    
    library(microbenchmark)
    microbenchmark(fun1(a, b), fun2(a, b), fun3(a,b), fun4(a,b), fun5(a,b))
    
    
    # Unit: microseconds
    #       expr       min        lq       mean    median         uq        max neval  cld
    # fun1(a, b)   389.630   399.128   408.6146   406.007   411.7690    540.969   100 a   
    # fun2(a, b)  5274.143  5445.038  6183.3945  5544.522  5762.1750  35830.143   100   c 
    # fun3(a, b)  2568.734  2629.494  2691.8360  2686.552  2729.0840   2956.618   100  b  
    # fun4(a, b)   482.585   511.917   530.0885   528.993   541.6685    779.679   100 a   
    # fun5(a, b) 53846.970 54293.798 56337.6531 54861.585 55184.3100 132921.883   100    d
    
    0 讨论(0)
  • 2020-12-11 09:32

    We can paste the 'a' elements to a single string with | as the delimiter, use that as pattern in grepl, negate (!) to subset 'b'.

     b[!grepl(paste(a, collapse="|"), b)]
    
    0 讨论(0)
提交回复
热议问题