Substring extraction from vector in R

问题

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:

countries <- c("United States", "Israel", "Canada")

How do I go about passing this vector of character values to extract exact matches from unstructured text.

text.df <- data.frame(ID = c(1:5), 
text = c("United States is a match", "Not a match", "Not a match",
         "Israel is a match", "Canada is a match"))

In this example, the desired output would be:

ID     text
1      United States
4      Israel
5      Canada

So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!

回答1:

1. stringr

We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')

library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
#  ID          text
#1  1 United States
#4  4        Israel
#5  5        Canada

2. base R

Without using any external packages, we can remove the characters other than those found in 'ind'

text.df1$text <- unlist(regmatches(text.df1$text, 
                           gregexpr(indx, text.df1$text)))

3. stringi

We could also use the faster stri_extract from stringi

library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
#  ID         text1
#1  1 United States
#4  4        Israel
#5  5        Canada

回答2:

Here's an approach with data.table:

library(data.table)
##
R>  data.table(text.df)[
    sapply(countries, function(x) grep(x,text),USE.NAMES=F),
    list(ID, text = countries)]
   ID          text
1:  1 United States
2:  4        Israel
3:  5        Canada

回答3:

Create the pattern, p, and use strapply to extract the match to each component of text returning NA for each unmatched component. Finally remove the NA values using na.omit. This is non-destructive (i.e. text.df is not modified):

library(gsubfn)

p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))

giving:

  ID          text
1  1 United States
4  4        Israel
5  5        Canada

Using dplyr it could also be written as follows (using p from above):

library(dplyr)
library(gsubfn)

text.df %>% 
  mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
  na.omit

来源：https://stackoverflow.com/questions/29196831/substring-extraction-from-vector-in-r

标签

regex

stringr