Similar to this case, i would like to count the number of occurrences of multiple words and numbers that occur in a vector of sentences with str_count of the stringr package.
But I noticed that not only whole numbers are counted but also partial numbers. For example:
df <- c("honda civic 1988 with new lights","toyota auris 4x4 140000 km","nissan skyline 2.0 159000 km")
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
library(stringr)
number_of_keywords_df <- str_count(df, paste(keywords, collapse='|'))
Here I recieve a vector for number_of_keywords_df of 3, 3, 3 while clearly, it should be 3, 2, 2. The str_count function seems to count the partial strings "1400" and "159" within the numbers "140000" and "159000". Is there any way of preventing that?
Using sprintf you can add word boundaries:
number_of_keywords_df <- str_count(df, paste(sprintf("\\b%s\\b", keywords), collapse = '|'))
number_of_keywords_df
Which yields
[1] 3 2 2
Try putting word boundaries around your keywords:
keywords <- c("honda","civic","toyota","auris","nissan","skyline","1988","1400","159")
keywords <- paste0("\\b", keywords, "\\b")
In regex lingo, \bhonda\b
says to match the isolated word honda
. Hence hondas
would not match because it has an extra letter at the end.
来源:https://stackoverflow.com/questions/49257263/counting-whole-word-number-occurrences-with-str-count-in-r