I have a dataset stored as a data.table DT
that looks like this:
print(DT)
category industry
1: administration admin
2: nurse practitioner truck
3: trucking truck
4: administration admin
5: warehousing nurse
6: warehousing admin
7: trucking truck
8: nurse practitioner nurse
9: nurse practitioner truck
I would like to reduce the table to only rows where the industry matches the category. My general approach is to use grepl()
to regex match the string '^{{INDUSTRY}}[a-z ]+$'
and each row of DT$category
, with each corresponding row of DT$industry
inserted in place of {{INDUSTRY}}
in the regex string using infuse()
. I struggled to find a sleek data.table solution that would properly loop through the table and make within-row comparisons, so I resorted to a for-loop to get the job done:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
ind <- DT[i]$industry
categ <- d.daily[i]$category
if (grepl(infuse(IND=ind,template),categ)){
DT[i]$match <- TRUE
}
}
DT<- DT[match==TRUE]
print(DT)
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse
However, I am sure this can be done in a better way. Any suggestions for how I could achieve this result by utilizing the data.table package's functionality? It's my understanding that, in this context, an approach that uses the package would likely be more efficient than a for-loop.
Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
This uses the current idiom for subsetting by group, thanks to @eddi .
Comments. These might help further:
If you have many rows with the same industry-category combo, try
by=.(industry,category)
.Try something else in the place of
grep
(like the options in Ken and Richard's answers).
As long as the match is always based on the start of the category
string, then this works just fine:
dt[substring(category, 1, nchar(industry)) == industry]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse
You could use stringi::stri_detect_fixed()
. It is vectorized over both str
and pattern
.
DT[stringi::stri_detect_fixed(category, industry)]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse
Alternatively, stringr::str_detect()
can be used. It is also vectorized over both its arguments.
library(stringr)
DT[str_detect(category, fixed(industry))]
Or a base R option is to run grepl()
through mapply()
DT[mapply(grepl, industry, category, fixed = TRUE)]
Or you could get the same result with Vectorize(grepl)
.
DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
All of these produce the same result.
Data:
DT <- structure(list(category = c("administration", "nurse practitioner",
"trucking", "administration", "warehousing", "warehousing", "trucking",
"nurse practitioner", "nurse practitioner"), industry = c("admin",
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse",
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA,
-9L))
setDT(DT)
来源:https://stackoverflow.com/questions/33699122/looping-grepl-through-data-table-r