I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
So far I have been working with gsub
by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract
from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!
1. stringr
We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')
library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
# ID text
#1 1 United States
#4 4 Israel
#5 5 Canada
2. base R
Without using any external packages, we can remove the characters other than those found in 'ind'
text.df1$text <- unlist(regmatches(text.df1$text,
gregexpr(indx, text.df1$text)))
3. stringi
We could also use the faster stri_extract
from stringi
library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
# ID text1
#1 1 United States
#4 4 Israel
#5 5 Canada
Here's an approach with data.table
:
library(data.table)
##
R> data.table(text.df)[
sapply(countries, function(x) grep(x,text),USE.NAMES=F),
list(ID, text = countries)]
ID text
1: 1 United States
2: 4 Israel
3: 5 Canada
Create the pattern, p
, and use strapply
to extract the match to each component of text
returning NA
for each unmatched component. Finally remove the NA values using na.omit
. This is non-destructive (i.e. text.df
is not modified):
library(gsubfn)
p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))
giving:
ID text
1 1 United States
4 4 Israel
5 5 Canada
Using dplyr it could also be written as follows (using p
from above):
library(dplyr)
library(gsubfn)
text.df %>%
mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
na.omit
来源:https://stackoverflow.com/questions/29196831/substring-extraction-from-vector-in-r