text-mining

Better text documents clustering than tf/idf and cosine similarity?

血红的双手。 提交于 2019-12-04 07:53:15
问题 I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad. The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences: 1- The website Stackoverflow is a nice place

Big Text Corpus breaks tm_map

怎甘沉沦 提交于 2019-12-04 06:46:52
I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus, it fails me. I have pasted in the code for one such folder operation. library(tm) # Framework for text

Memory error in python using numpy array

夙愿已清 提交于 2019-12-04 06:08:08
问题 I am getting the following error for this code: model = lda.LDA(n_topics=15, n_iter=50, random_state=1) model.fit(X) topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape)) print ("\n") n = 15 doc_topic=model.doc_topic_ for i in range(15): print("{} (top topic: {})".format(titles[i], doc_topic[0][i].argmax())) topic_csharp=np.zeros(shape=[1,n]) np.copyto(topic_csharp,doc_topic[0][i]) for i, topic_dist in enumerate(topic

Information Gain Calculation for a text file?

◇◆丶佛笑我妖孽 提交于 2019-12-04 05:27:36
问题 I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to

Calculate word co-occurance matrix in r

不羁岁月 提交于 2019-12-04 04:51:49
问题 I would like to calculate a word co-occurance matrix in R. I have the following data frame of sentences - dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F) dat[2,1] <- c("The girl is short.") dat[3,1] <- c("The tall boy and the short girl are friends.") Which gives me The boy is tall. The girl is short. The tall boy and the short girl are friends. What I want to do is to firstly, make a list of all of the unique words across all three sentences, namely The boy is tall

data frame of tfidf with python

痞子三分冷 提交于 2019-12-04 02:53:23
I have to classify some sentiments my data frame is like this Phrase Sentiment is it good movie positive wooow is it very goode positive bad movie negative i did some preprocessing as tokenisation stop words stemming etc ... and i get Phrase Sentiment [ good , movie ] positive [wooow ,is , it ,very, good ] positive [bad , movie ] negative I need finaly to get a dataframe wich the line are the text which the value is the tf_idf and the columns are the words like that good movie wooow very bad Sentiment tf idf tfidf_ tfidf tf_idf tf_idf positive ( same thing for the 2 remaining lines) MaxU I'd

Sentence to Word Table with R

南楼画角 提交于 2019-12-03 21:41:08
I have some sentences, from the sentences I want to separate the words to get row vector each. But the words are repeating to match with the largest sentence's row vector that I do not want. I want no matter how large the sentence is, the row vector of each of the sentences will only be the words one time. sentence <- c("case sweden", "meeting minutes ht board meeting st march now also attachment added agenda today s board meeting", "draft meeting minutes board meeting final meeting minutes ht board meeting rd april") sentence <- cbind(sentence) word_table <- do.call(rbind, strsplit(as

Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

我只是一个虾纸丫 提交于 2019-12-03 21:27:32
I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set! trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set. here's my (not so beautiful) code: # create data.frame with labelled sentences labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) # create training set and test set traindata <- as.data.frame(labeled[1:700,c("ARTICLE","CLASS")]) testdata <- as.data.frame(labeled[701:1000, c("ARTICLE","CLASS")]) # Vector,

How to write custom removePunctuation() function to better deal with Unicode chars?

有些话、适合烂在心里 提交于 2019-12-03 20:02:57
问题 In the source code of the tm text-mining R-package, in file transform.R, there is the removePunctuation() function, currently defined as: function(x, preserve_intra_word_dashes = FALSE) { if (!preserve_intra_word_dashes) gsub("[[:punct:]]+", "", x) else { # Assume there are no ASCII 1 characters. x <- gsub("(\\w)-(\\w)", "\\1\1\\2", x) x <- gsub("[[:punct:]]+", "", x) gsub("\1", "-", x, fixed = TRUE) } } I need to parse and mine some abstracts from a science conference (fetched from their

R tm package create matrix of Nmost frequent terms

蹲街弑〆低调 提交于 2019-12-03 13:43:43
I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B <- Matrix(mydata.dtm, sparse = TRUE) Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as