Filter where there are at least two pattern matches

只愿长相守 提交于 2019-12-06 13:01:21

问题


I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.

This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john".

I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.

At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!

Example data

text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
                                              "lucy has only moved here recently",
                                              "lucy and sarah are cousins",
                                              "john is also new to the area",
                                              "paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))

With the example data, I would want IDs 1 and 3 in the subsetted data.

Thanks for your help!


回答1:


We can paste the 'text_patterns' with the |, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table

library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
#    ID                                            text
#1:  1    lucy, sarah and paul live on the same street
#2:  3                      lucy and sarah are cousins
#3:  5 paul and john have known each other a long time

Update

If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect) and get the sum of all the patterns with + to create the logical vector for subsetting rows

i1 <- text_table[, Reduce(`+`, lapply(text_patterns, 
       function(x) str_detect(text, x))) >1]
text_table[i1]
#    ID                                         text
#1:  1 lucy, sarah and paul live on the same street
#2:  3                   lucy and sarah are cousins


来源:https://stackoverflow.com/questions/55655507/filter-where-there-are-at-least-two-pattern-matches

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!