Looping grepl() through data.table (R)

后端 未结 3 1566
梦毁少年i
梦毁少年i 2020-12-18 01:53

I have a dataset stored as a data.table DT that looks like this:

print(DT)
   category            industry
1: administration      admin
2: nurse         


        
相关标签:
3条回答
  • 2020-12-18 02:06

    As long as the match is always based on the start of the category string, then this works just fine:

    dt[substring(category, 1, nchar(industry)) == industry]
    #              category industry
    # 1:     administration    admin
    # 2:           trucking    truck
    # 3:     administration    admin
    # 4:           trucking    truck
    # 5: nurse practitioner    nurse
    
    0 讨论(0)
  • 2020-12-18 02:24

    Data.table is good at grouped operations; I think that's how it can help, assuming you have many rows with the same industry:

    DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]
    

    This uses the current idiom for subsetting by group, thanks to @eddi .


    Comments. These might help further:

    • If you have many rows with the same industry-category combo, try by=.(industry,category).

    • Try something else in the place of grep (like the options in Ken and Richard's answers).

    0 讨论(0)
  • 2020-12-18 02:28

    You could use stringi::stri_detect_fixed(). It is vectorized over both str and pattern.

    DT[stringi::stri_detect_fixed(category, industry)]
    #              category industry
    # 1:     administration    admin
    # 2:           trucking    truck
    # 3:     administration    admin
    # 4:           trucking    truck
    # 5: nurse practitioner    nurse 
    

    Alternatively, stringr::str_detect() can be used. It is also vectorized over both its arguments.

    library(stringr)
    DT[str_detect(category, fixed(industry))]
    

    Or a base R option is to run grepl() through mapply()

    DT[mapply(grepl, industry, category, fixed = TRUE)]
    

    Or you could get the same result with Vectorize(grepl).

    DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
    

    All of these produce the same result.

    Data:

    DT <- structure(list(category = c("administration", "nurse practitioner", 
    "trucking", "administration", "warehousing", "warehousing", "trucking", 
    "nurse practitioner", "nurse practitioner"), industry = c("admin", 
    "truck", "truck", "admin", "nurse", "admin", "truck", "nurse", 
    "truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA, 
    -9L))
    setDT(DT)
    
    0 讨论(0)
提交回复
热议问题