text-classification | 易学教程

How do I properly combine numerical features with text (bag of words) in scikit-learn?

阅读更多关于 How do I properly combine numerical features with text (bag of words) in scikit-learn?

问题 I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer import numpy as np numerical_features = [ [1, 0], [1, 1], [0, 0], [0, 1] ] corpus = [ 'This is the first document.', 'This is

Lexicon dictionary for synonym words

阅读更多关于 Lexicon dictionary for synonym words

There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc. Is there any dictionary available which contains list of synonym for all dictionary words? Like for nice synonyms: enjoyable, pleasant, pleasurable, agreeable, delightful, satisfying, gratifying, acceptable, to one's liking, entertaining, amusing, diverting, marvellous, good; alvas Although WordNet is a good resource to start for finding synonym, one must note its limitations, here's an example with python API in NLTK library: Firstly, words have multiple meanings (i.e. senses)

UserWarning: Label not :NUMBER: is present in all training examples

阅读更多关于 UserWarning: Label not :NUMBER: is present in all training examples

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ ('vectorizer', CountVectorizer(lowercase=True, stop_words='english', max_df = 0.8, min_df = 10)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) predicted = cross_val_predict(classifier, X, y) When running my code I get multiple warnings: UserWarning: Label not :NUMBER: is present in all training examples. When I print out

How to use spark Naive Bayes classifier for text classification with IDF?

阅读更多关于 How to use spark Naive Bayes classifier for text classification with IDF?

I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each individual document and keep the labels, but then I can't call IDF() on it since it requires the whole

How to show topics of reuters dataset in Keras?

阅读更多关于 How to show topics of reuters dataset in Keras?

I use reuters dataset in Keras. And I want to know the 46 topics' names. How can I show topics of reuters dataset in Keras? https://keras.io/datasets/#reuters-newswire-topics-classification Associated mapping of topic labels as per original Reuters Dataset with the topic indexes in Keras version is: ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply', 'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas', 'cpi','money-fx','interest','gnp','meal-feed','alum','oilseed','gold','tin', 'strategic-metal','livestock','retail','ipi','iron-steel',

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

阅读更多关于 How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document. However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential? Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit

python textblob and text classification

阅读更多关于 python textblob and text classification

I'm trying do build a text classification model with python and textblob , the script is runing on my server and in the future the idea is that users will be able to submit their text and it will be classified. i'm loading the training set from csv : # -*- coding: utf-8 -*- import sys import codecs sys.stdout = open('yyyyyyyyy.txt',"w"); from nltk.tokenize import word_tokenize from textblob.classifiers import NaiveBayesClassifier with open('file.csv', 'r', encoding='latin-1') as fp: cl = NaiveBayesClassifier(fp, format="csv") print(cl.classify("some text")) csv is about 500 lines long (with

Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

阅读更多关于 Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

I am doing multi-label classification where I am trying to predict correct tags to questions: (X = questions, y = list of tags for each question from X). I am wondering, which decision_function_shape for sklearn.svm.SVC should be be used with OneVsRestClassifier ? From docs we can read that decision_function_shape can have two values 'ovo' and 'ovr' : decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which

How do I properly combine numerical features with text (bag of words) in scikit-learn?

阅读更多关于 How do I properly combine numerical features with text (bag of words) in scikit-learn?

I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer import numpy as np numerical_features = [ [1, 0], [1, 1], [0, 0], [0, 1] ] corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one', 'Is this the first document?', ] bag_of_words

Multilabel Text Classification using TensorFlow

阅读更多关于 Multilabel Text Classification using TensorFlow

问题 The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. i-th element indicates the frequency of the i-th word in a text. The ground truth label data is also represented as vector with 4,000 elements, like [0, 0, 1, 0, 1, ...., 0]. i-th element indicates whether the i-th label is a positive label for a text. The number of labels for a text differs depending on texts. I have a code for single-label text classification. How can I edit the following code for

订阅 text-classification