text-mining | 易学教程

Better text documents clustering than tf/idf and cosine similarity?

阅读更多关于 Better text documents clustering than tf/idf and cosine similarity?

问题 I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad. The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences: 1- The website Stackoverflow is a nice place

Big Text Corpus breaks tm_map

阅读更多关于 Big Text Corpus breaks tm_map

I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus, it fails me. I have pasted in the code for one such folder operation. library(tm) # Framework for text

Memory error in python using numpy array

阅读更多关于 Memory error in python using numpy array

问题 I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic

Information Gain Calculation for a text file?

阅读更多关于 Information Gain Calculation for a text file?

问题 I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to

Calculate word co-occurance matrix in r

阅读更多关于 Calculate word co-occurance matrix in r

问题 I would like to calculate a word co-occurance matrix in R. I have the following data frame of sentences - dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F) dat[2,1] <- c("The girl is short.") dat[3,1] <- c("The tall boy and the short girl are friends.") Which gives me The boy is tall. The girl is short. The tall boy and the short girl are friends. What I want to do is to firstly, make a list of all of the unique words across all three sentences, namely The boy is tall

data frame of tfidf with python

阅读更多关于 data frame of tfidf with python

I have to classify some sentiments my data frame is like this Phrase Sentiment is it good movie positive wooow is it very goode positive bad movie negative i did some preprocessing as tokenisation stop words stemming etc ... and i get Phrase Sentiment [ good , movie ] positive [wooow ,is , it ,very, good ] positive [bad , movie ] negative I need finaly to get a dataframe wich the line are the text which the value is the tf_idf and the columns are the words like that good movie wooow very bad Sentiment tf idf tfidf_ tfidf tf_idf tf_idf positive ( same thing for the 2 remaining lines) MaxU I'd

Sentence to Word Table with R

阅读更多关于 Sentence to Word Table with R

I have some sentences, from the sentences I want to separate the words to get row vector each. But the words are repeating to match with the largest sentence's row vector that I do not want. I want no matter how large the sentence is, the row vector of each of the sentences will only be the words one time. sentence <- c("case sweden", "meeting minutes ht board meeting st march now also attachment added agenda today s board meeting", "draft meeting minutes board meeting final meeting minutes ht board meeting rd april") sentence <- cbind(sentence) word_table <- do.call(rbind, strsplit(as

Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

阅读更多关于 Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set! trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set. here's my (not so beautiful) code: # create data.frame with labelled sentences labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) # create training set and test set traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")]) testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")]) # Vector,

How to write custom removePunctuation() function to better deal with Unicode chars?

阅读更多关于 How to write custom removePunctuation() function to better deal with Unicode chars?

问题 In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their

R tm package create matrix of Nmost frequent terms

阅读更多关于 R tm package create matrix of Nmost frequent terms

I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B <- Matrix(mydata.dtm, sparse = TRUE) Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as