R Text Mining: Counting the number of times a specific word appears in a corpus?

前端 未结 3 1023
悲&欢浪女
悲&欢浪女 2021-01-03 05:57

I have seen this question answered in other languages but not in R.

[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus

3条回答
  •  刺人心
    刺人心 (楼主)
    2021-01-03 06:07

    This is how I'd approach the problem now:

    library(tm)
    library(qdap)
    
    ## Create a MWE like you should have done:
    corpus1 <- 'I have seen this question answered in other languages but not in R.
    [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. 
    Now I would like to search for the number of times these phrases have appeared in another corpus.
    Is there a way to do this in TM package? (Or another related package)
    For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of 
    couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
    As always, I appreciate all your help!'
    
    corpus2 <- "What have you tried? If you have seen it answered in another language, why don't you try translating that 
    language into R? – Eric Strom 2 hours ago
    I am not a coder, otherwise would do. I just do not know a way to do this. – appletree 1 hour ago
    Could you provide some example? or show what you have in mind for input and output? or a pseudo code? 
    As it is I find the question a bit too general. As it sounds I think you could use regular expressions 
    with grep to find your 'tags'. – AndresT 15 mins ago"
    

    ## Now the code:

    ## create the corpus and extract frequent terms (top7)
    corp1 <- Corpus(VectorSource(corpus1))
    (terms <- apply_as_df(corp1, freq_terms, top=7, stopwords=tm::stopwords("en")))
    
    ##   WORD     FREQ
    ## 1 corpus      3
    ## 2 phrases     3
    ## 3 another     2
    ## 4 appeared    2
    ## 5 corpusb     2
    ## 6 obtained    2
    ## 7 tags        2
    ## 8 times       2
    
    ## Use termco to search for these top 7 terms in a new corpus
    corp2 <- Corpus(VectorSource(corpus2))
    apply_as_df(corp2, termco, match.list=terms[, 1])
    
    ##   docs word.count corpus phrases  another appeared corpusb obtained     tags times
    ## 1    1         96      0       0 1(1.04%)        0       0        0 1(1.04%)     0
    

提交回复
热议问题