document-classification

How to change attribute type to String (WEKA - CSV to ARFF)

空扰寡人 提交于 2019-12-08 23:15:31
I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes: @attribute label {ham,spam} @attribute text {'Go until jurong point','Ok lar...', etc.} Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all instances. Having the text attribute as a String will allow me to use the StringToWordVector filter for

DocumentTermMatrix fails with a strange error only when # terms > 3000

岁酱吖の 提交于 2019-12-06 00:32:34
问题 My code below works fine unless I use create a DocumentTermMatrix with more that 3000 terms. This line: movie_dict <- findFreqTerms(movie_dtm_train, 8) movie_dtm_hiFq_train <- DocumentTermMatrix(movie_corpus_train, list(dictionary = movie_dict)) Fails with: Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered

Example for Stanford NLP Classifier

余生长醉 提交于 2019-12-05 20:59:17
I am trying to learn the Stanford NLP Classifier and would like to work on the problem of document classification. Can anyone suggest a place where I can find a working example? I was also looking at the Open NLP libraries and was able to find many working examples, like http://tharindu-rusira.blogspot.com/2013/12/opennlp-text-classifier.html So, as we can see here, it is quite easy to figure out what's going on and create a small working prototype. However, I can't find a simple example for stanford NLP which will show me How to specify training data for a classifier. How to train a model.

How to implement TF_IDF feature weighting with Naive Bayes

随声附和 提交于 2019-12-04 13:00:20
I'm trying to implement the naive Bayes classifier for sentiment analysis. I plan to use the TF-IDF weighting measure. I'm just a little stuck now. NB generally uses the word(feature) frequency to find the maximum likelihood. So how do I introduce the TF-IDF weighting measure in naive Bayes? You can visit the following blog shows in detail how do you calculate TFIDF. You use the TF-IDF weights as features/predictors in your statistical model. I suggest to use either gensim [1]or scikit-learn [2] to compute the weights, which you then pass to your Naive Bayes fitting procedure. The scikit-learn

How to include words as numerical feature in classification

梦想与她 提交于 2019-12-03 13:29:34
问题 Whats the best method to use the words itself as the features in any machine learning algorithm ? The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ? In general, How are words itself used as features in NLP ? 回答1: There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data

Scalable or online out-of-core multi-label classifiers

时光怂恿深爱的人放手 提交于 2019-12-02 23:31:23
I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K. I am using scikit-learn . For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have. vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words=

How to calculate TF*IDF for a single new document to be classified?

折月煮酒 提交于 2019-12-02 18:13:11
I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too. My question is, how could I calculate the TF*IDF with just a single document? As far as I understand, TF can be calculated based on a single document itself, but the IDF can only

Text classification/categorization algorithm [closed]

江枫思渺然 提交于 2019-12-02 17:40:45
My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET library that implements ше? Ralph M. Rickenbach Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category. Yet, in natural language text, the keywords would

Suppressing the output in libsvm (python)

浪子不回头ぞ 提交于 2019-12-01 06:49:21
I am using libsvm (svmutils) from python for a classification task. The classifier is exact. However, I am getting output like this: * optimization finished, #iter = 75 nu = 0.000021 obj = -0.024330, rho = 0.563710 nSV = 26, nBSV = 0 Total nSV = 26 * optimization finished, #iter = 66 nu = 0.000030 obj = -0.035536, rho = -0.500676 nSV = 21, nBSV = 0 Total nSV = 21 * optimization finished, #iter = 78 nu = 0.000029 obj = -0.033921, rho = -0.543311 nSV = 23, nBSV = 0 Total nSV = 23 * optimization finished, #iter = 90 nu = 0.000030 obj = -0.035333, rho = -0.634721 nSV = 23, nBSV = 0 Total nSV = 23

Suppressing the output in libsvm (python)

给你一囗甜甜゛ 提交于 2019-12-01 04:12:48
问题 I am using libsvm (svmutils) from python for a classification task. The classifier is exact. However, I am getting output like this: * optimization finished, #iter = 75 nu = 0.000021 obj = -0.024330, rho = 0.563710 nSV = 26, nBSV = 0 Total nSV = 26 * optimization finished, #iter = 66 nu = 0.000030 obj = -0.035536, rho = -0.500676 nSV = 21, nBSV = 0 Total nSV = 21 * optimization finished, #iter = 78 nu = 0.000029 obj = -0.033921, rho = -0.543311 nSV = 23, nBSV = 0 Total nSV = 23 * optimization