Finding 2 & 3 word Phrases Using R TM Package

后端 未结 7 1988
死守一世寂寞
死守一世寂寞 2020-11-28 04:26

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it th

7条回答
  •  Happy的楠姐
    2020-11-28 04:57

    Try this code.

    library(tm)
    library(SnowballC)
    library(class)
    library(wordcloud)
    
    keywords <- read.csv(file.choose(), header = TRUE, na.strings=c("NA","-","?"))
    keywords_doc <- Corpus(VectorSource(keywords$"use your column that you need"))
    keywords_doc <- tm_map(keywords_doc, removeNumbers)
    keywords_doc <- tm_map(keywords_doc, tolower)
    keywords_doc <- tm_map(keywords_doc, stripWhitespace)
    keywords_doc <- tm_map(keywords_doc, removePunctuation)
    keywords_doc <- tm_map(keywords_doc, PlainTextDocument)
    keywords_doc <- tm_map(keywords_doc, stemDocument)
    

    This is the bigrams or tri grams section that you could use

    BigramTokenizer <-  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
    # creating of document matrix
    keywords_matrix <- TermDocumentMatrix(keywords_doc, control = list(tokenize = BigramTokenizer))
    
    # remove sparse terms 
    keywords_naremoval <- removeSparseTerms(keywords_matrix, 0.95)
    
    # Frequency of the words appearing
    keyword.freq <- rowSums(as.matrix(keywords_naremoval))
    subsetkeyword.freq <-subset(keyword.freq, keyword.freq >=20)
    frequentKeywordSubsetDF <- data.frame(term = names(subsetkeyword.freq), freq = subsetkeyword.freq) 
    
    # Sorting of the words
    frequentKeywordDF <- data.frame(term = names(keyword.freq), freq = keyword.freq)
    frequentKeywordSubsetDF <- frequentKeywordSubsetDF[with(frequentKeywordSubsetDF, order(-frequentKeywordSubsetDF$freq)), ]
    frequentKeywordDF <- frequentKeywordDF[with(frequentKeywordDF, order(-frequentKeywordDF$freq)), ]
    
    # Printing of the words
    wordcloud(frequentKeywordDF$term, freq=frequentKeywordDF$freq, random.order = FALSE, rot.per=0.35, scale=c(5,0.5), min.freq = 30, colors = brewer.pal(8,"Dark2"))
    

    Hope this helps. This is an entire code that you could use.

提交回复
热议问题