R remove multiple text strings in data frame

◇◆丶佛笑我妖孽 提交于 2019-11-28 12:48:05
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")

(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))

#   id          text time username
# 1  1      ai and x   10       me
# 2  2 and computing    5      you
# 3  3       nothing   15 everyone
# 4  4   ibm privacy    0     know

(dat1 <- as.data.frame(sapply(dat, function(x) 
  gsub(paste(wordstoremove, collapse = '|'), '', x))))

#   id    text time username
# 1  1   and x   10       me
# 2  2    and     5      you
# 3  3 nothing   15 everyone
# 4  4            0     know

Another option using dplyr::mutate() and stringr::str_remove_all():

library(dplyr)
library(stringr)

dat <- dat %>%   
  mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))

Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.

The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.

str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').

rawr's anwswer could be updated to:

dat1 <- as.data.frame(sapply(dat, function(x) 
  gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!