Filter where there are at least two pattern matches

问题

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.

This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john".

I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.

At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!

Example data

text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
                                              "lucy has only moved here recently",
                                              "lucy and sarah are cousins",
                                              "john is also new to the area",
                                              "paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))

With the example data, I would want IDs 1 and 3 in the subsetted data.

Thanks for your help!

回答1:

We can paste the 'text_patterns' with the |, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table

library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
#    ID                                            text
#1:  1    lucy, sarah and paul live on the same street
#2:  3                      lucy and sarah are cousins
#3:  5 paul and john have known each other a long time

Update

If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect) and get the sum of all the patterns with + to create the logical vector for subsetting rows

i1 <- text_table[, Reduce(`+`, lapply(text_patterns, 
       function(x) str_detect(text, x))) >1]
text_table[i1]
#    ID                                         text
#1:  1 lucy, sarah and paul live on the same street
#2:  3                   lucy and sarah are cousins

来源：https://stackoverflow.com/questions/55655507/filter-where-there-are-at-least-two-pattern-matches

标签

data.table

subset