Difference of two character vectors with substring

前端未结

关注

 3  678

I have two lists:

a <- c(\"da\", \"ba\", \"cs\", \"dd\", \"ek\")
b <- c(\"zyc\", \"ulk\", \"mae\", \"csh\", \"ddi\", \"dada\")

I want

相关标签:

3条回答

滥情空心

2020-12-11 09:25
You could try the following:
```
b[!(+(apply(sapply(a, function(x) grepl(x,b)),1,sum)) > 0)]
[1] "zyc" "ulk" "mae"
```
'Peeling' this previous call from the inside, the results are the following: First, obtain a matrix of matches from the grepl: call (with sapply):
```
sapply(a, function(x) grepl(x,b))
#        da    ba    cs    dd    ek
#[1,] FALSE FALSE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE  TRUE FALSE FALSE
#[5,] FALSE FALSE FALSE  TRUE FALSE
#[6,]  TRUE FALSE FALSE FALSE FALSE
```
Note that the columns are the elements of a and the rows are the elements of b.

Then, apply the function sum per rows (in R, TRUE is 1 and FALSE is 0:
```
apply(sapply(a, function(x) grepl(x,b)),1,sum)
#[1] 0 0 0 1 1 1
```
Note that here, the row sums might be > 1 (if there is more than 1 match), so it must be coerced into a logical with the previous call wrapped around:
```
+() > 0
```
With this, we can match ([) the indices of b, but since we want the opposite, we use the operator !.
```
#full code:
step.one <- sapply(a, function(x) grepl(x,b))
step.two <- apply(step.one,1,sum)
step.three <- +(step.two > 0)
step.four <- !step.three
#finally:
b[step.four]
```
As David shows in the comments, this is a much more elegant approach:
```
b[-which(sapply(a, grepl, b), arr.ind = TRUE)[, "row"]]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

清歌不尽

2020-12-11 09:31

And another solution using a simple for loop:

sel <- rep(FALSE, length(b))
for (i in seq_along(a)) {
  sel <- sel | grepl(a[i], b, fixed = TRUE)
}
b[!sel]

Not as elegant as some as the other solutions (especially the one by akrun), but showing that a for loop isn't always as slow in R as people believe:

fun1 <- function(a, b) {
  sel <- rep(FALSE, length(b))
  for (i in seq_along(a)) {
    sel <- sel | grepl(a[i], b, fixed = TRUE)
  }
  b[!sel]
}

fun2 <- function(a, b) {
  b[!apply(sapply(a, function(x) grepl(x,b, fixed=TRUE)),1,sum)]
}

fun3 <- function(a, b) {
  b[-which(sapply(a, grepl, b, fixed=TRUE), arr.ind = TRUE)[, "row"]]
}

fun4 <- function(a, b) {
  b[!grepl(paste(a, collapse="|"), b)]
}

library(stringr)
fun5 <- function(a, b) {
  b[!sapply(b, function(u) any(str_detect(u,a)))]
}

a <- c("da", "ba", "cs", "dd", "ek")
b <- c("zyc", "ulk", "mae", "csh", "ddi", "dada")
b <- rep(b, length.out = 1E3)

library(microbenchmark)
microbenchmark(fun1(a, b), fun2(a, b), fun3(a,b), fun4(a,b), fun5(a,b))


# Unit: microseconds
#       expr       min        lq       mean    median         uq        max neval  cld
# fun1(a, b)   389.630   399.128   408.6146   406.007   411.7690    540.969   100 a   
# fun2(a, b)  5274.143  5445.038  6183.3945  5544.522  5762.1750  35830.143   100   c 
# fun3(a, b)  2568.734  2629.494  2691.8360  2686.552  2729.0840   2956.618   100  b  
# fun4(a, b)   482.585   511.917   530.0885   528.993   541.6685    779.679   100 a   
# fun5(a, b) 53846.970 54293.798 56337.6531 54861.585 55184.3100 132921.883   100    d

0 讨论(0)

南方客

2020-12-11 09:32
We can paste the 'a' elements to a single string with | as the delimiter, use that as pattern in grepl, negate (!) to subset 'b'.
```
 b[!grepl(paste(a, collapse="|"), b)]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...