text-mining

bigrams instead of single words in termdocument matrix using R and Rweka

阅读更多关于 bigrams instead of single words in termdocument matrix using R and Rweka

问题 I\'ve found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R The idea goes something like this: library(tm) library(RWeka) data(crude) #Tokenizer for n-grams and passed on to the term-document matrix constructor BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) However

R tm package invalid input in 'utf8towcs'

阅读更多关于 R tm package invalid input in 'utf8towcs'

问题 I\'m trying to use the tm package in R to perform some text analysis. I tied the following: require(tm) dataSet <- Corpus(DirSource(\'tmp/\')) dataSet <- tm_map(dataSet, tolower) Error in FUN(X[[6L]], ...) : invalid input \'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp\' in \'utf8towcs\' The problem is some characters are not valid. I\'d like to exclude the invalid characters from analysis either from

What is “entropy and information gain”?

阅读更多关于 What is “entropy and information gain”?

问题 I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)? 回答1: I assume entropy was mentioned in the context of building decision trees. To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names

bigrams instead of single words in termdocument matrix using R and Rweka

R tm package invalid input in &#39;utf8towcs&#39;

What is “entropy and information gain”?

R tm package invalid input in 'utf8towcs'