Looping grepl() through data.table (R)

I have a dataset stored as a data.table DT that looks like this:

print(DT)
   category            industry
1: administration      admin
2: nurse practitioner  truck
3: trucking            truck
4: administration      admin
5: warehousing         nurse
6: warehousing         admin
7: trucking            truck
8: nurse practitioner  nurse         
9: nurse practitioner  truck

I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl() to regex match the string '^{{INDUSTRY}}[a-z ]+$' and each row of DT$category, with each corresponding row of DT$industry inserted in place of {{INDUSTRY}} in the regex string using infuse(). I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:

template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
    ind <- DT[i]$industry
    categ <- d.daily[i]$category
    if (grepl(infuse(IND=ind,template),categ)){
        DT[i]$match <- TRUE
    }
}
DT<- DT[match==TRUE]
print(DT)
       category            industry
1: administration      admin
2: trucking            truck
3: administration      admin
4: trucking            truck
5: nurse practitioner  nurse

However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.

Frank

Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:

DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

This uses the current idiom for subsetting by group, thanks to @eddi .

Comments. These might help further:

If you have many rows with the same industry-category combo, try by=.(industry,category).
Try something else in the place of grep (like the options in Ken and Richard's answers).

As long as the match is always based on the start of the category string, then this works just fine:

dt[substring(category, 1, nchar(industry)) == industry]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse

You could use stringi::stri_detect_fixed(). It is vectorized over both str and pattern.

DT[stringi::stri_detect_fixed(category, industry)]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse

Alternatively, stringr::str_detect() can be used. It is also vectorized over both its arguments.

library(stringr)
DT[str_detect(category, fixed(industry))]

Or a base R option is to run grepl() through mapply()

DT[mapply(grepl, industry, category, fixed = TRUE)]

Or you could get the same result with Vectorize(grepl).

DT[Vectorize(grepl)(industry, category, fixed = TRUE)]

All of these produce the same result.

Data:

DT <- structure(list(category = c("administration", "nurse practitioner", 
"trucking", "administration", "warehousing", "warehousing", "trucking", 
"nurse practitioner", "nurse practitioner"), industry = c("admin", 
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
-9L))
setDT(DT)

来源：https://stackoverflow.com/questions/33699122/looping-grepl-through-data-table-r

标签

regex

data.table

data-cleaning