问题
I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!
回答1:
1. stringr
We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')
library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
# ID text
#1 1 United States
#4 4 Israel
#5 5 Canada
2. base R
Without using any external packages, we can remove the characters other than those found in 'ind'
text.df1$text <- unlist(regmatches(text.df1$text,
gregexpr(indx, text.df1$text)))
3. stringi
We could also use the faster stri_extract from stringi
library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
# ID text1
#1 1 United States
#4 4 Israel
#5 5 Canada
回答2:
Here's an approach with data.table:
library(data.table)
##
R> data.table(text.df)[
sapply(countries, function(x) grep(x,text),USE.NAMES=F),
list(ID, text = countries)]
ID text
1: 1 United States
2: 4 Israel
3: 5 Canada
回答3:
Create the pattern, p, and use strapply to extract the match to each component of text returning NA for each unmatched component. Finally remove the NA values using na.omit. This is non-destructive (i.e. text.df is not modified):
library(gsubfn)
p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))
giving:
ID text
1 1 United States
4 4 Israel
5 5 Canada
Using dplyr it could also be written as follows (using p from above):
library(dplyr)
library(gsubfn)
text.df %>%
mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
na.omit
来源:https://stackoverflow.com/questions/29196831/substring-extraction-from-vector-in-r