Grepl group of strings and count frequency of all using R

一世执手 提交于 2021-02-11 12:08:23

问题


I have a column of 50k rows of tweets named text from a csv file (the tweets consists of sentences, phrases etc). I'm trying to count frequency of several words in that column. Is there an easier way to do it vs what I'm doing below?

# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)


# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee    <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs    <- grepl("mugs", text$tweets, ignore.case=TRUE)


# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)

sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)

Expected Output (assuming I have more than 2 words up there)

Word   Freq
coffee  50
mugs    40
cup     64
pen     12

回答1:


You can create a vector of the words that you want to count frequency/percentage for and use sapply to calculate them.

words <- c('coffee', 'mugs')

data.frame(words, t(sapply(paste0('\\b', words, '\\b'), function(x) {
  tmp <- grepl(x, tweets$text)
  c(perc = mean(tmp) * 100, 
    Freq = sum(tmp))
})), row.names = NULL) -> result
result

#   words     perc Freq
#1 coffee 33.33333    1
#2   mugs 66.66667    2

sapply is similar to for loop as it iterates over each word defined in words. grepl returns TRUE/FALSE values indicating if the word is present in tweets$text which is stored in tmp. To count the frequency we use sum and for percentage we use mean. Also added word boundary (\\b) to the words so that they match completely in the text hence 'coffee' does not match with 'coffees' etc.

data

tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs', 
                              'This has only mugs', 
                              'This has nothing'))


来源:https://stackoverflow.com/questions/66073470/grepl-group-of-strings-and-count-frequency-of-all-using-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!