R Text Mining: Counting the number of times a specific word appears in a corpus?

前端 未结 3 1020
悲&欢浪女
悲&欢浪女 2021-01-03 05:57

I have seen this question answered in other languages but not in R.

[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus

3条回答
  •  春和景丽
    2021-01-03 06:20

    Ain't perfect but this should get you started.

    #User Defined Function
    strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
        strp <- function(x, digit.remove, apostrophe.remove){
            x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
            x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
            ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
        }
    unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
        apostrophe.remove = apostrophe.remove)) ))
    }
    #==================================================================
    #Create 2 'corpus' documents (you'd have to actually do all this in tm
    corpus1 <- 'I have seen this question answered in other languages but not in R.
    [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. 
    Now I would like to search for the number of times these phrases have appeared in another corpus.
    Is there a way to do this in TM package? (Or another related package)
    For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of 
    couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
    As always, I appreciate all your help!'
    
    corpus2 <- "What have you tried? If you have seen it answered in another language, why don't you try translating that 
    language into R? – Eric Strom 2 hours ago
    I am not a coder, otherwise would do. I just do not know a way to do this. – appletree 1 hour ago
    Could you provide some example? or show what you have in mind for input and output? or a pseudo code? 
    As it is I find the question a bit too general. As it sounds I think you could use regular expressions 
    with grep to find your 'tags'. – AndresT 15 mins ago"
    #=======================================================
    #Clean up the text
    corpus1 <- gsub("\\s+", " ", gsub("\n|\t", " ", corpus1))
    corpus2 <- gsub("\\s+", " ", gsub("\n|\t", " ", corpus2))
    
    corpus1.wrds <- as.vector(unlist(strsplit(strip(corpus1), " ")))
    corpus2.wrds <- as.vector(unlist(strsplit(strip(corpus2), " ")))
    
    #create frequency tables for each corpus
    corpus1.Freq <- data.frame(table(corpus1.wrds))
    corpus1.Freq$corpus1.wrds  <- as.character(corpus1.Freq$corpus1.wrds)
    corpus1.Freq <- corpus1.Freq[order(-corpus1.Freq$Freq), ]
    rownames(corpus1.Freq) <- 1:nrow(corpus1.Freq)
    key.terms <- corpus1.Freq[corpus1.Freq$Freq>2, 'corpus1.wrds'] #key words to match on corpus 2
    
    corpus2.Freq <- data.frame(table(corpus2.wrds))
    corpus2.Freq$corpus2.wrds  <- as.character(corpus2.Freq$corpus2.wrds)
    corpus2.Freq <- corpus2.Freq[order(-corpus2.Freq$Freq), ]
    rownames(corpus2.Freq) <- 1:nrow(corpus2.Freq)
    
    #Match key words to the words in corpus 2
    corpus2.Freq[corpus2.Freq$corpus2.wrds %in%key.terms, ]
    

提交回复
热议问题