Finding ngrams in R and comparing ngrams across corpora

有些话、适合烂在心里 提交于 2019-11-30 06:56:52

I could not reproduce your problem, are you using the latest versions of R, tm, RWeka, etc.?

require(tm)
a <- Corpus(DirSource("C:\\Downloads\\Only1965\\Only1965"))
summary(a)  
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) 
# a <- tm_map(a, stemDocument, language = "english") 
# I also got it to work with stemming, but it takes so long...
adtm <-DocumentTermMatrix(a) 
adtm <- removeSparseTerms(adtm, 0.75)

inspect(adtm) 

findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations  
findAssocs(adtm, "china",.5)

# Trigrams
require(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
tdm <- removeSparseTerms(tdm, 0.75)
inspect(tdm[1:5,1:5])

And here's what I get

A term-document matrix (5 terms, 5 documents)

Non-/sparse entries: 11/14
Sparsity           : 56%
Maximal term length: 28 
Weighting          : term frequency (tf)

                                   Docs
Terms                               PR1965-01.txt PR1965-02.txt PR1965-03.txt
  †chinese press                              0             0             0
  †renmin ribao                               0             1             1
  — renmin ribao                              2             5             2
  “ chinese people                            0             0             0
  “renmin ribaoâ€\u009d editorial             0             1             0
  etc. 

Regarding your step two, here are some pointers to useful starts:

http://quantifyingmemory.blogspot.com/2013/02/mapping-significant-textual-differences.html http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ and here's his code https://dl.dropboxusercontent.com/u/4713959/Neuchatel/NassrProgram.R

cenau

Regarding Step 1, Brian.keng gives a one liner workaround here https://stackoverflow.com/a/20251039/3107920 that solves this issue on Mac OSX - it seems to be related to parallelisation rather than ( the minor nightmare that is ) java setup on mac.

Ashish M

You may want to explicitly access the functions like this

BigramTokenizer  <- function(x) {
    RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))
}

myTdmBi.d <- TermDocumentMatrix(
    myCorpus.d,
    control = list(tokenize = BigramTokenizer, weighting = weightTfIdf)
)

Also, some other things that randomly came up.

myCorpus.d <- tm_map(myCorpus.d, tolower)  # This does not work anymore 

Try this instead

 myCorpus.d <- tm_map(myCorpus.d, content_transformer(tolower))  # Make lowercase

In the RTextTools package,

create_matrix(as.vector(C$V2), ngramLength=3) # ngramLength throws an error message.

Further to Ben's answer - I couldn't reproduce this either, but in the past I've had trouble with the plyr package and conflicting dependencies. In my case there was a conflict between Hmisc and ddply. You could try adding this line just prior to the offending line of code:

tryCatch(detach("package:Hmisc"), error = function(e) NULL)

Apologies if this is completely tangental to your problem!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!