R Text Mining: Counting the number of times a specific word appears in a corpus?

前端未结

关注

 3  1031

悲&欢浪女 2021-01-03 05:57

I have seen this question answered in other languages but not in R.

[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus

3条回答

刺人心 (楼主)

2021-01-03 06:07

This is how I'd approach the problem now:

library(tm)
library(qdap)

## Create a MWE like you should have done:
corpus1 <- 'I have seen this question answered in other languages but not in R.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. 
Now I would like to search for the number of times these phrases have appeared in another corpus.
Is there a way to do this in TM package? (Or another related package)
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of 
couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
As always, I appreciate all your help!'

corpus2 <- "What have you tried? If you have seen it answered in another language, why don't you try translating that 
language into R? – Eric Strom 2 hours ago
I am not a coder, otherwise would do. I just do not know a way to do this. – appletree 1 hour ago
Could you provide some example? or show what you have in mind for input and output? or a pseudo code? 
As it is I find the question a bit too general. As it sounds I think you could use regular expressions 
with grep to find your 'tags'. – AndresT 15 mins ago"

## Now the code:

## create the corpus and extract frequent terms (top7)
corp1 <- Corpus(VectorSource(corpus1))
(terms <- apply_as_df(corp1, freq_terms, top=7, stopwords=tm::stopwords("en")))

##   WORD     FREQ
## 1 corpus      3
## 2 phrases     3
## 3 another     2
## 4 appeared    2
## 5 corpusb     2
## 6 obtained    2
## 7 tags        2
## 8 times       2

## Use termco to search for these top 7 terms in a new corpus
corp2 <- Corpus(VectorSource(corpus2))
apply_as_df(corp2, termco, match.list=terms[, 1])

##   docs word.count corpus phrases  another appeared corpusb obtained     tags times
## 1    1         96      0       0 1(1.04%)        0       0        0 1(1.04%)     0

0 讨论(0)

查看其它3个回答