text-classification

McNemar's test in Python and comparison of classification machine learning models [closed]

倾然丶 夕夏残阳落幕 提交于 2019-12-22 03:46:06
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Is there a good McNemar's test implemented in Python? I don't see it anywhere in Scipy.stats or Scikit-Learn. I may have overlooked some other good packages. Please recommend. McNemar's Test is almost THE test for comparing two classification algorithms/models given a holdout test set (not through K-fold or

Naive-bayes multinomial text classifier using Data frame in Scala Spark

余生颓废 提交于 2019-12-21 20:22:23
问题 I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new

CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

泄露秘密 提交于 2019-12-21 04:07:10
问题 I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizer to convert text data into numerical vectors. However, I am getting an error saying: AttributeError: 'numpy.ndarray' object has no attribute 'lower' mealarray contains large strings in each of the elements. There are 5000 such samples. I am trying to vectorize this as given below: vectorizer = CountVectorizer( stop_words='english', ngram_range=(1, 1), #ngram_range=(1, 1) is the

python textblob and text classification

☆樱花仙子☆ 提交于 2019-12-21 02:59:07
问题 I'm trying do build a text classification model with python and textblob, the script is runing on my server and in the future the idea is that users will be able to submit their text and it will be classified. i'm loading the training set from csv : # -*- coding: utf-8 -*- import sys import codecs sys.stdout = open('yyyyyyyyy.txt',"w"); from nltk.tokenize import word_tokenize from textblob.classifiers import NaiveBayesClassifier with open('file.csv', 'r', encoding='latin-1') as fp: cl =

Scalable or online out-of-core multi-label classifiers

醉酒当歌 提交于 2019-12-20 10:49:23
问题 I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K. I am using scikit-learn . For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number

Dimension of shape in conv1D

て烟熏妆下的殇ゞ 提交于 2019-12-17 05:40:17
问题 I have tried to build a CNN with one layer, but I have some problem with it. Indeed, the compilator says me that ValueError: Error when checking model input: expected conv1d_1_input to have 3 dimensions, but got array with shape (569, 30) This is the code import numpy from keras.models import Sequential from keras.layers.convolutional import Conv1D numpy.random.seed(7) datasetTraining = numpy.loadtxt("CancerAdapter.csv",delimiter=",") X = datasetTraining[:,1:31] Y = datasetTraining[:,0]

Dimension of shape in conv1D

旧城冷巷雨未停 提交于 2019-12-17 05:40:00
问题 I have tried to build a CNN with one layer, but I have some problem with it. Indeed, the compilator says me that ValueError: Error when checking model input: expected conv1d_1_input to have 3 dimensions, but got array with shape (569, 30) This is the code import numpy from keras.models import Sequential from keras.layers.convolutional import Conv1D numpy.random.seed(7) datasetTraining = numpy.loadtxt("CancerAdapter.csv",delimiter=",") X = datasetTraining[:,1:31] Y = datasetTraining[:,0]

weka batch filtering StringToWordVector

大憨熊 提交于 2019-12-13 17:30:15
问题 I'm trying to use Weka for text classification. I have two ARFF files: One for the training set (example of row in data): "mouse",no,no,no,no,no,yes,no and another one for test set (example of row in data:) "cat",?,?,?,?,?,?,? They have the same attribute declaration. But if I use batch filtering it tells me "Input file formats differ". Why? Here is the command that I use: C:\Programmi\Weka-3-6>java -cp C:\Programmi\Weka-3-6\weka.jar weka.filters.unsupervised.attribute.StringToWordVector -b

Why Mallet text classification output the same value 1.0 for all test files?

旧巷老猫 提交于 2019-12-13 03:38:23
问题 I am learning Mallet text classification command lines. The output values for estimating differrent classes are all the same 1.0. I do not know where I am incorrect. Can you help? mallet version: E:\Mallet\mallet-2.0.8RC3 //there is a txt file about cat breed (catmaterial.txt) in cat dir. //command 1 C:\Users\toshiba>mallet import-dir --input E:\Mallet\testmaterial\cat --output E :\Mallet\testmaterial\cat.mallet --remove-stopwords //command 1 output Labels = E:\Mallet\testmaterial\cat /

UserWarning: Label not :NUMBER: is present in all training examples

僤鯓⒐⒋嵵緔 提交于 2019-12-12 08:34:40
问题 I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ ('vectorizer', CountVectorizer(lowercase=True, stop_words='english', max_df = 0.8, min_df = 10)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) predicted = cross_val_predict(classifier, X, y) When running my code I get