Find the most frequently occuring words in a text in R

后端 未结 5 1606
刺人心
刺人心 2020-12-14 13:33

Can someone help me with how to find the most frequently used two and three words in a text using R?

My text is...

text <- c(\"Th         


        
5条回答
  •  没有蜡笔的小新
    2020-12-14 14:06

    The tidytext package makes this sort of thing pretty simple:

    library(tidytext)
    library(dplyr)
    
    data_frame(text = text) %>% 
        unnest_tokens(word, text) %>%    # split words
        anti_join(stop_words) %>%    # take out "a", "an", "the", etc.
        count(word, sort = TRUE)    # count occurrences
    
    # Source: local data frame [73 x 2]
    # 
    #           word     n
    #          (chr) (int)
    # 1       phrase     8
    # 2     sentence     6
    # 3        words     4
    # 4       called     3
    # 5       common     3
    # 6  grammatical     3
    # 7      meaning     3
    # 8         alex     2
    # 9         bird     2
    # 10    complete     2
    # ..         ...   ...
    

    If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:

    library(tokenizers)
    
    tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>%    # tokenize bigrams and trigrams
        as_data_frame() %>%    # structure
        count(value, sort = TRUE)    # count
    
    # Source: local data frame [531 x 2]
    # 
    #           value     n
    #          (fctr) (int)
    # 1        of the     5
    # 2      a phrase     4
    # 3  the sentence     4
    # 4          as a     3
    # 5        in the     3
    # 6        may be     3
    # 7    a complete     2
    # 8   a phrase is     2
    # 9    a sentence     2
    # 10      a white     2
    # ..          ...   ...
    

提交回复
热议问题