text-classification

sklearn classifier get ValueError: bad input shape

眉间皱痕 提交于 2019-12-10 12:31:53
问题 I have a csv, struct is CAT1,CAT2,TITLE,URL,CONTENT , CAT1, CAT2, TITLE ,CONTENT are in chinese. I want train LinearSVC or MultinomialNB with X(TITLE) and feature(CAT1,CAT2), both get this error. below is my code: PS: I write below code through this example scikit-learn text_analytics import numpy as np import csv from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline label_list = [] def label_map_target(label): '''

Simple text classification using naive bayes (weka) in java

馋奶兔 提交于 2019-12-10 04:35:09
问题 I try to do text classification naive bayes weka libarary in my java code, but i think the result of the classification is not correct, i don't know what's the problem. I use arff file for the input. this is my training data: @relation hamspam @attribute text string @attribute class {spam,ham} @data 'good',ham 'good',ham 'very good',ham 'bad',spam 'very bad',spam 'very bad, very bad',spam 'good good bad',ham this is my testing_data: @relation test @attribute text string @attribute class {spam

How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

房东的猫 提交于 2019-12-08 03:05:54
问题 I'm using the sklearn TfidfVectorizer for text-classification. I know this vectorizer wants raw text as input, but using a list works (see input1). However, if I want to use multiple lists (or sets) I get the following Attribute error. Does anyone know how to tackle this problem? Thanks in advance! from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(min_df=1, stop_words="english") input1 = ["This", "is", "a", "test"] input2 = [["This", "is", "a", "test"],

Error using “TermDocumentMatrix” and “Dist” functions in R

会有一股神秘感。 提交于 2019-12-08 02:11:38
问题 I have been trying to replicate the example here: but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm

FastText using pre-trained word vector for text classification

≡放荡痞女 提交于 2019-12-06 22:16:33
问题 I am working on a text classification problem, that is, given some text, I need to assign to it certain given labels. I have tried using fast-text library by Facebook, which has two utilities of interest to me: A) Word Vectors with pre-trained models B) Text Classification utilities However, it seems that these are completely independent tools as I have been unable to find any tutorials that merge these two utilities. What I want is to be able to classify some text, by taking advantage of the

Why do Tensorflow tf.learn classification results vary a lot?

允我心安 提交于 2019-12-06 15:30:55
问题 I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial classifier = tf.contrib.learn.DNNClassifier( hidden_units=[10], n_classes=2, dropout=0.1, feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data)) classifier.fit(x=training_set.data, y

Error using “TermDocumentMatrix” and “Dist” functions in R

柔情痞子 提交于 2019-12-06 13:39:30
I have been trying to replicate the example here : but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm_map(docs7, tolower) To this: docs8 <- tm_map(docs7, content_transformer(tolower)) But then I got in

Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

99封情书 提交于 2019-12-06 11:47:47
问题 I am implementing different classifiers using different machine learning algorithms. I'm sorting text files, and do as follows: classifier = Pipeline([ ('vectorizer', CountVectorizer ()), ('TFIDF', TfidfTransformer ()), ('clf', OneVsRestClassifier (GaussianNB()))]) classifier.fit(X_train,Y) predicted = classifier.predict(X_test) When I use the algorithm GaussianNB the following error occurs: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a

How can I use a list of lists, or a list of sets, for the TfidfVectorizer?

徘徊边缘 提交于 2019-12-06 06:26:56
I'm using the sklearn TfidfVectorizer for text-classification. I know this vectorizer wants raw text as input, but using a list works (see input1). However, if I want to use multiple lists (or sets) I get the following Attribute error. Does anyone know how to tackle this problem? Thanks in advance! from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(min_df=1, stop_words="english") input1 = ["This", "is", "a", "test"] input2 = [["This", "is", "a", "test"], ["It", "is", "raining", "today"]] print(vectorizer.fit_transform(input1)) #works print(vectorizer.fit

R: problems applying LIME to quanteda text model

ε祈祈猫儿з 提交于 2019-12-06 05:20:15
it's a modified version of my previous question : I'm trying to run LIME on my quanteda text model that feeds off Trump & Clinton tweets data . I run it following an example given by Thomas Pedersen in his Understanding LIME and useuful SO answer provided by @Weihuang Wong : library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") # creating corpus and dfm for train and test sets get_matrix <- function(df){ corpus <- quanteda::corpus(df) dfm <- quanteda::dfm(corpus, remove_url = TRUE, remove_punct = TRUE, remove = stopwords("english")) }