text-classification

Lexicon dictionary for synonym words

萝らか妹 提交于 2019-12-06 04:28:09
问题 There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc. Is there any dictionary available which contains list of synonym for all dictionary words? Like for nice synonyms: enjoyable, pleasant, pleasurable, agreeable, delightful, satisfying, gratifying, acceptable, to one's liking, entertaining, amusing, diverting, marvellous, good; 回答1: Although WordNet is a good resource to start for finding synonym, one must note its limitations,

text classifier with bag of words and additional sentiment feature in sklearn

。_饼干妹妹 提交于 2019-12-06 02:16:57
I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence). I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes. df = pd.DataFrame.from_records(data=data, columns=names) train, test = train_test_split( df, train_size=train_ratio, random_state=1337 )

SMOTE oversampling and cross-validation

旧巷老猫 提交于 2019-12-05 09:04:47
I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE ( http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html ) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%. Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem? I think you should split

Simple text classification using naive bayes (weka) in java

穿精又带淫゛_ 提交于 2019-12-05 07:32:22
I try to do text classification naive bayes weka libarary in my java code, but i think the result of the classification is not correct, i don't know what's the problem. I use arff file for the input. this is my training data: @relation hamspam @attribute text string @attribute class {spam,ham} @data 'good',ham 'good',ham 'very good',ham 'bad',spam 'very bad',spam 'very bad, very bad',spam 'good good bad',ham this is my testing_data: @relation test @attribute text string @attribute class {spam,ham} @data 'good bad very bad',? 'good bad very bad',? 'good',? 'good very good',? 'bad',? 'very good'

FastText using pre-trained word vector for text classification

只谈情不闲聊 提交于 2019-12-05 03:08:51
I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this? FastText's native classification

How to use spark Naive Bayes classifier for text classification with IDF?

喜夏-厌秋 提交于 2019-12-04 23:48:29
问题 I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each

Why do Tensorflow tf.learn classification results vary a lot?

痞子三分冷 提交于 2019-12-04 20:49:36
I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial classifier = tf.contrib.learn.DNNClassifier( hidden_units=[10], n_classes=2, dropout=0.1, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data)) classifier.fit(x=training_set.data, y=training_set.target, steps=100) val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation

How to show topics of reuters dataset in Keras?

被刻印的时光 ゝ 提交于 2019-12-04 18:55:10
问题 I use reuters dataset in Keras. And I want to know the 46 topics' names. How can I show topics of reuters dataset in Keras? https://keras.io/datasets/#reuters-newswire-topics-classification 回答1: Associated mapping of topic labels as per original Reuters Dataset with the topic indexes in Keras version is: ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply', 'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas', 'cpi','money-fx','interest'

Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

你。 提交于 2019-12-04 16:31:47
I am implementing different classifiers using different machine learning algorithms. I'm sorting text files, and do as follows: classifier = Pipeline([ ('vectorizer', CountVectorizer ()), ('TFIDF', TfidfTransformer ()), ('clf', OneVsRestClassifier (GaussianNB()))]) classifier.fit(X_train,Y) predicted = classifier.predict(X_test) When I use the algorithm GaussianNB the following error occurs: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array. I saw the following post here In this post a class is created to perform the

Naive-bayes multinomial text classifier using Data frame in Scala Spark

让人想犯罪 __ 提交于 2019-12-04 12:29:55
I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words") val regexTokenizer = new RegexTokenizer()