text-classification

How do I properly combine numerical features with text (bag of words) in scikit-learn?

删除回忆录丶 提交于 2019-12-04 10:53:03
问题 I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer import numpy as np numerical_features = [ [1, 0], [1, 1], [0, 0], [0, 1] ] corpus = [ 'This is the first document.', 'This is

Lexicon dictionary for synonym words

别说谁变了你拦得住时间么 提交于 2019-12-04 09:10:12
There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc. Is there any dictionary available which contains list of synonym for all dictionary words? Like for nice synonyms: enjoyable, pleasant, pleasurable, agreeable, delightful, satisfying, gratifying, acceptable, to one's liking, entertaining, amusing, diverting, marvellous, good; alvas Although WordNet is a good resource to start for finding synonym, one must note its limitations, here's an example with python API in NLTK library: Firstly, words have multiple meanings (i.e. senses)

UserWarning: Label not :NUMBER: is present in all training examples

偶尔善良 提交于 2019-12-04 03:58:34
I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ ('vectorizer', CountVectorizer(lowercase=True, stop_words='english', max_df = 0.8, min_df = 10)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) predicted = cross_val_predict(classifier, X, y) When running my code I get multiple warnings: UserWarning: Label not :NUMBER: is present in all training examples. When I print out

How to use spark Naive Bayes classifier for text classification with IDF?

谁说胖子不能爱 提交于 2019-12-03 15:35:34
I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to weight the words according to how important they are. But if I do that I get rid of the labels and it seems to be impossible to recombine the label with the vector even though the order is the same. On the other hand, I can call HashingTF() on each individual document and keep the labels, but then I can't call IDF() on it since it requires the whole

How to show topics of reuters dataset in Keras?

℡╲_俬逩灬. 提交于 2019-12-03 12:37:32
I use reuters dataset in Keras. And I want to know the 46 topics' names. How can I show topics of reuters dataset in Keras? https://keras.io/datasets/#reuters-newswire-topics-classification Associated mapping of topic labels as per original Reuters Dataset with the topic indexes in Keras version is: ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply', 'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas', 'cpi','money-fx','interest','gnp','meal-feed','alum','oilseed','gold','tin', 'strategic-metal','livestock','retail','ipi','iron-steel',

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

一曲冷凌霜 提交于 2019-12-03 10:32:14
I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document. However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential? Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit

python textblob and text classification

醉酒当歌 提交于 2019-12-03 09:00:38
I'm trying do build a text classification model with python and textblob , the script is runing on my server and in the future the idea is that users will be able to submit their text and it will be classified. i'm loading the training set from csv : # -*- coding: utf-8 -*- import sys import codecs sys.stdout = open('yyyyyyyyy.txt',"w"); from nltk.tokenize import word_tokenize from textblob.classifiers import NaiveBayesClassifier with open('file.csv', 'r', encoding='latin-1') as fp: cl = NaiveBayesClassifier(fp, format="csv") print(cl.classify("some text")) csv is about 500 lines long (with

Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?

£可爱£侵袭症+ 提交于 2019-12-03 08:40:45
I am doing multi-label classification where I am trying to predict correct tags to questions: (X = questions, y = list of tags for each question from X). I am wondering, which decision_function_shape for sklearn.svm.SVC should be be used with OneVsRestClassifier ? From docs we can read that decision_function_shape can have two values 'ovo' and 'ovr' : decision_function_shape : ‘ovo’, ‘ovr’ or None, default=None Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which

How do I properly combine numerical features with text (bag of words) in scikit-learn?

让人想犯罪 __ 提交于 2019-12-03 06:52:58
I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this: from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer import numpy as np numerical_features = [ [1, 0], [1, 1], [0, 0], [0, 1] ] corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one', 'Is this the first document?', ] bag_of_words

Multilabel Text Classification using TensorFlow

痞子三分冷 提交于 2019-12-03 00:53:41
问题 The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. i-th element indicates the frequency of the i-th word in a text. The ground truth label data is also represented as vector with 4,000 elements, like [0, 0, 1, 0, 1, ...., 0]. i-th element indicates whether the i-th label is a positive label for a text. The number of labels for a text differs depending on texts. I have a code for single-label text classification. How can I edit the following code for