text-classification | 易学教程

McNemar's test in Python and comparison of classification machine learning models [closed]

阅读更多关于 McNemar's test in Python and comparison of classification machine learning models [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Is there a good McNemar's test implemented in Python? I don't see it anywhere in Scipy.stats or Scikit-Learn. I may have overlooked some other good packages. Please recommend. McNemar's Test is almost THE test for comparing two classification algorithms/models given a holdout test set (not through K-fold or

Naive-bayes multinomial text classifier using Data frame in Scala Spark

阅读更多关于 Naive-bayes multinomial text classifier using Data frame in Scala Spark

问题 I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label): label| feature| +-----+--------------------+ | 1|combusting prepar...| | 1|adhesives for ind...| | 1| | | 1| salt for preserving| | 1|auxiliary fluids ...| I have used following transformation for tokenization, stopword, n-gram, and hashTF : val selectedData = df.select("label", "feature") // Tokenize RDD val tokenizer = new

CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

阅读更多关于 CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

问题 I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizer to convert text data into numerical vectors. However, I am getting an error saying: AttributeError: 'numpy.ndarray' object has no attribute 'lower' mealarray contains large strings in each of the elements. There are 5000 such samples. I am trying to vectorize this as given below: vectorizer = CountVectorizer( stop_words='english', ngram_range=(1, 1), #ngram_range=(1, 1) is the

python textblob and text classification

阅读更多关于 python textblob and text classification

问题 I'm trying do build a text classification model with python and textblob, the script is runing on my server and in the future the idea is that users will be able to submit their text and it will be classified. i'm loading the training set from csv : # -*- coding: utf-8 -*- import sys import codecs sys.stdout = open('yyyyyyyyy.txt',"w"); from nltk.tokenize import word_tokenize from textblob.classifiers import NaiveBayesClassifier with open('file.csv', 'r', encoding='latin-1') as fp: cl =

Scalable or online out-of-core multi-label classifiers

阅读更多关于 Scalable or online out-of-core multi-label classifiers

问题 I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels. I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K. I am using scikit-learn . For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number

Dimension of shape in conv1D

阅读更多关于 Dimension of shape in conv1D

问题 I have tried to build a CNN with one layer, but I have some problem with it. Indeed, the compilator says me that ValueError: Error when checking model input: expected conv1d_1_input to have 3 dimensions, but got array with shape (569, 30) This is the code import numpy from keras.models import Sequential from keras.layers.convolutional import Conv1D numpy.random.seed(7) datasetTraining = numpy.loadtxt("CancerAdapter.csv",delimiter=",") X = datasetTraining[:,1:31] Y = datasetTraining[:,0]

Dimension of shape in conv1D

阅读更多关于 Dimension of shape in conv1D

weka batch filtering StringToWordVector

阅读更多关于 weka batch filtering StringToWordVector

问题 I'm trying to use Weka for text classification. I have two ARFF files: One for the training set (example of row in data): "mouse",no,no,no,no,no,yes,no and another one for test set (example of row in data:) "cat",?,?,?,?,?,?,? They have the same attribute declaration. But if I use batch filtering it tells me "Input file formats differ". Why? Here is the command that I use: C:\Programmi\Weka-3-6>java -cp C:\Programmi\Weka-3-6\weka.jar weka.filters.unsupervised.attribute.StringToWordVector -b

Why Mallet text classification output the same value 1.0 for all test files?

阅读更多关于 Why Mallet text classification output the same value 1.0 for all test files?

问题 I am learning Mallet text classification command lines. The output values for estimating differrent classes are all the same 1.0. I do not know where I am incorrect. Can you help? mallet version: E:\Mallet\mallet-2.0.8RC3 //there is a txt file about cat breed (catmaterial.txt) in cat dir. //command 1 C:\Users\toshiba>mallet import-dir --input E:\Mallet\testmaterial\cat --output E :\Mallet\testmaterial\cat.mallet --remove-stopwords //command 1 output Labels = E:\Mallet\testmaterial\cat /

UserWarning: Label not :NUMBER: is present in all training examples

阅读更多关于 UserWarning: Label not :NUMBER: is present in all training examples

问题 I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ ('vectorizer', CountVectorizer(lowercase=True, stop_words='english', max_df = 0.8, min_df = 10)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) predicted = cross_val_predict(classifier, X, y) When running my code I get