R Subset Dataset Using Regular Expression

对着背影说爱祢 提交于 2019-12-31 01:52:09

问题


Is there a way to make the R code below run quicker (i.e. vectorized to avoid use of for loops)?

My example contains two data frames. First is dimension n1*p. One of the p columns contains names. Second data frame is a column vector (n2*1). It contains names as well. I want to keep all rows of the first data frame, where some part of the name in the column vector of the second data frame appears in the corresponding first data frame. Sorry for the brutal explanation.

Example (Data frame 1):

x        y 
Doggy    1 
Hello    2 
Hi Dog   3 
Zebra    4 

Example (Data frame 2)

z
Hello
Dog

So in the above example I want to keep rows 1,2,3 but NOT 4. Since "Dog" appears in "Doggy" and "Hi Dog". And "Hello" appears in "Hello". Exclude row four since no part of "Hello" or "Dog" appears in "Zebra".

Below is my R code to do this...runs fine. However, for my real task. Data frame 1 has 1 million rows and data frame 2 has 50 items to match on. So runs pretty slow. Any suggestion on how to speed this up are appreciated.

x <- c("Doggy", "Hello", "Hi Dog", "Zebra")
y <- 1:4
dat <- as.data.frame(cbind(x,y))
names(dat) <- c("x","y")

z <- as.data.frame(c("Hello", "Dog"))
names(z) <- c("z")

dat$flag <- NA
for(j in 1:length(z$z)){
for(i in 1:dim(dat)[1]){ 

    if ( is.na(dat$flag[i])==TRUE ) {
        dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))
    } else {

    if (dat$flag[i]==0) {
        dat$flag[i] <- length(grep(paste(z[j,1]), dat[i,1], perl=TRUE, value=TRUE))

    } else { 

    if (dat$flag[i]==1) {
        dat$flag[i]==1
    }
    }
    }
}
}

dat1 <- subset(dat, flag==1)
dat1  

回答1:


Try this:

dat[grep(paste(z$z, collapse = "|"), dat$x), ]

or

subset(dat, grepl(paste(z$z, collapse = "|"), x))



回答2:


This question inspired a boolean text search function (%bs%) in the qdap package and thus I thought I'd share the approach to this question:

library(qdap)
dat[dat$x %bs% paste(z$z, collapse = "OR"), ]

In this case no less typing but if multiple or/and statements are involved this may be a useful approach.



来源:https://stackoverflow.com/questions/19640562/r-subset-dataset-using-regular-expression

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!