Substring extraction from vector in R

别说谁变了你拦得住时间么 提交于 2019-12-01 22:59:02

1. stringr

We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')

library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
#  ID          text
#1  1 United States
#4  4        Israel
#5  5        Canada

2. base R

Without using any external packages, we can remove the characters other than those found in 'ind'

text.df1$text <- unlist(regmatches(text.df1$text, 
                           gregexpr(indx, text.df1$text)))

3. stringi

We could also use the faster stri_extract from stringi

library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
#  ID         text1
#1  1 United States
#4  4        Israel
#5  5        Canada

Here's an approach with data.table:

library(data.table)
##
R>  data.table(text.df)[
    sapply(countries, function(x) grep(x,text),USE.NAMES=F),
    list(ID, text = countries)]
   ID          text
1:  1 United States
2:  4        Israel
3:  5        Canada

Create the pattern, p, and use strapply to extract the match to each component of text returning NA for each unmatched component. Finally remove the NA values using na.omit. This is non-destructive (i.e. text.df is not modified):

library(gsubfn)

p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))

giving:

  ID          text
1  1 United States
4  4        Israel
5  5        Canada

Using dplyr it could also be written as follows (using p from above):

library(dplyr)
library(gsubfn)

text.df %>% 
  mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
  na.omit
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!