document-classification

How to implement TF_IDF feature weighting with Naive Bayes

有些话、适合烂在心里 提交于 2020-01-01 17:28:08
问题 I'm trying to implement the naive Bayes classifier for sentiment analysis. I plan to use the TF-IDF weighting measure. I'm just a little stuck now. NB generally uses the word(feature) frequency to find the maximum likelihood. So how do I introduce the TF-IDF weighting measure in naive Bayes? 回答1: You can visit the following blog shows in detail how do you calculate TFIDF. 回答2: You use the TF-IDF weights as features/predictors in your statistical model. I suggest to use either gensim [1]or

How to implement TF_IDF feature weighting with Naive Bayes

谁说我不能喝 提交于 2020-01-01 17:28:00
问题 I'm trying to implement the naive Bayes classifier for sentiment analysis. I plan to use the TF-IDF weighting measure. I'm just a little stuck now. NB generally uses the word(feature) frequency to find the maximum likelihood. So how do I introduce the TF-IDF weighting measure in naive Bayes? 回答1: You can visit the following blog shows in detail how do you calculate TFIDF. 回答2: You use the TF-IDF weights as features/predictors in your statistical model. I suggest to use either gensim [1]or

How to change attribute type to String (WEKA - CSV to ARFF)

可紊 提交于 2019-12-23 02:57:19
问题 I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes: @attribute label {ham,spam} @attribute text {'Go until jurong point','Ok lar...', etc.} Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all

Scalable or online out-of-core multi-label classifiers

醉酒当歌 提交于 2019-12-20 10:49:23
问题 I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K. I am using scikit-learn . For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number

Text classification/categorization algorithm [closed]

时间秒杀一切 提交于 2019-12-20 09:19:40
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an

How much text can Weka handle?

风流意气都作罢 提交于 2019-12-12 04:58:20
问题 I have a sentiment analysis task and I need to specify how much data (in my case text) weka can handle. I have a corpus of 2500 opinions already tagged. I know that it´s a small corpus but my thesis advisor is asking me to specifically argue on how much data can Weka handle. 回答1: Your limitation with Weka will be on whatever learning algorithm you use and how much memory you have available for training. Most classifiers require the whole set be loaded into memory for training, but there are

Libsvm model file format No model number

痴心易碎 提交于 2019-12-12 01:51:22
问题 I am using libsvm for document classification. I use svm.cc and svm.h in my project. I then call svm_train. I save the model in a file using svm_save_model. I have there categories. The svm model file is: svm_type c_svc kernel_type rbf gamma 0.001002 nr_class 3 total_sv 9 rho -0.000766337 0.00314423 0.00387654 label 0 1 2 nr_sv 3 3 3 SV 1 1 1:0.001 2:0.001 3:0.012521912 5:0.001 15:0.012521912 17:0.012521912 23:0.001 1 1 1:0.001 2:0.014176543 4:0.093235799 6:0.001 7:0.0058630699 9:0.040529628

R: building text Classifier

拈花ヽ惹草 提交于 2019-12-10 11:47:37
问题 I have content set that has to be classified based on few rules . sample data: 1 chin jeffrey hong kong wednesday october global business reporting cc subramanian raghuveer kumar m santhosh antoo ramesh subject request obtain global icis data dear team appreciate can distribute monthly basis latest global icis data ramesh antoo upon availability regards jeffrey chin associate business risk strategy efficiency brse asia international institutional banking australia new zealand banking group

Example for Stanford NLP Classifier

会有一股神秘感。 提交于 2019-12-10 09:47:28
问题 I am trying to learn the Stanford NLP Classifier and would like to work on the problem of document classification. Can anyone suggest a place where I can find a working example? I was also looking at the Open NLP libraries and was able to find many working examples, like http://tharindu-rusira.blogspot.com/2013/12/opennlp-text-classifier.html So, as we can see here, it is quite easy to figure out what's going on and create a small working prototype. However, I can't find a simple example for

Part of Speech (POS) tag Feature Selection for Text Classification

我怕爱的太早我们不能终老 提交于 2019-12-09 07:01:42
问题 I have the POS tag sentences obtain using Stanford POS tagger. Eg: The/DT island/NN was/VBD very/RB beautiful/JJ ./. I/PRP love/VBP it/PRP ./. (xml format also available) Can anyone explain how to perform feature selection from this POS tag sentences and convert them into feature vector for text classification using machine learning method. 回答1: A simple way to start out would be something like the following (assuming word order is not important for your classification algorithm). First you