text-mining

bigrams instead of single words in termdocument matrix using R and Rweka

牧云@^-^@ 提交于 2019-11-26 10:59:14
问题 I\'ve found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R The idea goes something like this: library(tm) library(RWeka) data(crude) #Tokenizer for n-grams and passed on to the term-document matrix constructor BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) However

R tm package invalid input in &#39;utf8towcs&#39;

安稳与你 提交于 2019-11-26 09:25:52
问题 I\'m trying to use the tm package in R to perform some text analysis. I tied the following: require(tm) dataSet <- Corpus(DirSource(\'tmp/\')) dataSet <- tm_map(dataSet, tolower) Error in FUN(X[[6L]], ...) : invalid input \'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp\' in \'utf8towcs\' The problem is some characters are not valid. I\'d like to exclude the invalid characters from analysis either from

What is “entropy and information gain”?

时间秒杀一切 提交于 2019-11-26 06:09:18
问题 I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)? 回答1: I assume entropy was mentioned in the context of building decision trees. To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names