Looping grepl() through data.table (R)

◇◆丶佛笑我妖孽 提交于 2019-11-29 07:03:53
Frank

Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:

DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]

This uses the current idiom for subsetting by group, thanks to @eddi .


Comments. These might help further:

  • If you have many rows with the same industry-category combo, try by=.(industry,category).

  • Try something else in the place of grep (like the options in Ken and Richard's answers).

As long as the match is always based on the start of the category string, then this works just fine:

dt[substring(category, 1, nchar(industry)) == industry]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse

You could use stringi::stri_detect_fixed(). It is vectorized over both str and pattern.

DT[stringi::stri_detect_fixed(category, industry)]
#              category industry
# 1:     administration    admin
# 2:           trucking    truck
# 3:     administration    admin
# 4:           trucking    truck
# 5: nurse practitioner    nurse 

Alternatively, stringr::str_detect() can be used. It is also vectorized over both its arguments.

library(stringr)
DT[str_detect(category, fixed(industry))]

Or a base R option is to run grepl() through mapply()

DT[mapply(grepl, industry, category, fixed = TRUE)]

Or you could get the same result with Vectorize(grepl).

DT[Vectorize(grepl)(industry, category, fixed = TRUE)]

All of these produce the same result.

Data:

DT <- structure(list(category = c("administration", "nurse practitioner", 
"trucking", "administration", "warehousing", "warehousing", "trucking", 
"nurse practitioner", "nurse practitioner"), industry = c("admin", 
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
-9L))
setDT(DT)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!