text-classification | 易学教程

Lexicon dictionary for synonym words

阅读更多关于 Lexicon dictionary for synonym words

问题 There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc. Is there any dictionary available which contains list of synonym for all dictionary words? Like for nice synonyms: enjoyable, pleasant, pleasurable, agreeable, delightful, satisfying, gratifying, acceptable, to one's liking, entertaining, amusing, diverting, marvellous, good; 回答1: Although WordNet is a good resource to start for finding synonym, one must note its limitations,

text classifier with bag of words and additional sentiment feature in sklearn

阅读更多关于 text classifier with bag of words and additional sentiment feature in sklearn

I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence). I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes. df = pd.DataFrame.from_records(data=data, columns=names) train, test = train_test_split( df, train_size=train_ratio, random_state=1337 )

SMOTE oversampling and cross-validation

阅读更多关于 SMOTE oversampling and cross-validation

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE ( http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html ) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%. Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem? I think you should split

Simple text classification using naive bayes (weka) in java

阅读更多关于 Simple text classification using naive bayes (weka) in java

I try to do text classification naive bayes weka libarary in my java code, but i think the result of the classification is not correct, i don't know what's the problem. I use arff file for the input. this is my training data: @relation hamspam @attribute text string @attribute class {spam,ham} @data 'good',ham 'good',ham 'very good',ham 'bad',spam 'very bad',spam 'very bad, very bad',spam 'good good bad',ham this is my testing_data: @relation test @attribute text string @attribute class {spam,ham} @data 'good bad very bad',? 'good bad very bad',? 'good',? 'good very good',? 'bad',? 'very good'

FastText using pre-trained word vector for text classification

阅读更多关于 FastText using pre-trained word vector for text classification

I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the pre-trained models of the Word-Vectors. Is there any way to do this? FastText's native classification

How to use spark Naive Bayes classifier for text classification with IDF?

阅读更多关于 How to use spark Naive Bayes classifier for text classification with IDF?

问题 I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each

Why do Tensorflow tf.learn classification results vary a lot?

阅读更多关于 Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial classifier = tf.contrib.learn.DNNClassifier( hidden_units=[10], n_classes=2, dropout=0.1, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data)) classifier.fit(x=training_set.data, y=training_set.target, steps=100) val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation

How to show topics of reuters dataset in Keras?

阅读更多关于 How to show topics of reuters dataset in Keras?

问题 I use reuters dataset in Keras. And I want to know the 46 topics' names. How can I show topics of reuters dataset in Keras? https://keras.io/datasets/#reuters-newswire-topics-classification 回答1: Associated mapping of topic labels as per original Reuters Dataset with the topic indexes in Keras version is: ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply', 'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas', 'cpi','money-fx','interest'

Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

阅读更多关于 Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

I am implementing different classifiers using different machine learning algorithms. I'm sorting text files, and do as follows: classifier = Pipeline([ ('vectorizer', CountVectorizer ()), ('TFIDF', TfidfTransformer ()), ('clf', OneVsRestClassifier (GaussianNB()))]) classifier.fit(X_train,Y) predicted = classifier.predict(X_test) When I use the algorithm GaussianNB the following error occurs: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array. I saw the following post here In this post a class is created to perform the

Naive-bayes multinomial text classifier using Data frame in Scala Spark

阅读更多关于 Naive-bayes multinomial text classifier using Data frame in Scala Spark

I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words") val regexTokenizer = new RegexTokenizer()