classification

Multiple classification models in a scikit pipeline python

百般思念 提交于 2019-12-12 11:15:17
问题 I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5 . I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV() . I cannot have multiple Pipelines running during a single implementation due to concurrency issues,

Plotting learning curve in keras gives KeyError: 'val_acc'

有些话、适合烂在心里 提交于 2019-12-12 11:04:30
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 3 years ago . I was trying to plot train and test learning curve in keras, however, the following code produces KeyError: 'val_acc error . The official document <https://keras.io/callbacks/> states that in order to use 'val_acc' I need to enable validation and accuracy monitoring which I dont understand and dont know how to use in my code. Any help would be much appreciated. Thanks. seed = 7

Load pickled classifier data : Vocabulary not fitted Error

瘦欲@ 提交于 2019-12-12 10:54:59
问题 I have read all related questions here but couldn't find a working solution : My classifier creation : class StemmedTfidfVectorizer(TfidfVectorizer): def build_analyzer(self): analyzer = super(TfidfVectorizer, self).build_analyzer() return lambda doc: english_stemmer.stemWords(analyzer(doc)) tf = StemmedTfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, max_features=200000, stop_words = 'english') def create_tfidf(f): docs = [] targets = [] with open(f, "r") as sentences_file:

ValueError: The number of classes has to be greater than one (python)

若如初见. 提交于 2019-12-12 10:49:22
问题 When passing x,y in fit , I am getting the following error: Traceback (most recent call last): File "C:/Classify/classifier.py", line 95, in train_avg, test_avg, cms = train_model(X, y, "ceps", plot=True) File "C:/Classify/classifier.py", line 47, in train_model clf.fit(X_train, y_train) File "C:\Python27\lib\site-packages\sklearn\svm\base.py", line 676, in fit raise ValueError("The number of classes has to be greater than" ValueError: The number of classes has to be greater than one. Below

Does sklearn support a cost matrix?

你离开我真会死。 提交于 2019-12-12 10:35:35
问题 Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j. The main classifier I am using is a Random Forest. Thanks. 回答1: The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have. 回答2: One way to circumvent this limitation is to use under or oversampling. E.g.,

Retrieve final hidden activation layer output from sklearn's MLPClassifier

為{幸葍}努か 提交于 2019-12-12 10:06:07
问题 I would like to do some tests with neural network final hidden activation layer outputs using sklearn's MLPClassifier after fit ting the data. for example, If I create a classifier, assuming data X_train with labels y_train and two hidden layers of sizes (300,100) clf = MLPClassifier(hidden_layer_sizes=(300,100)) clf.fit(X_train,y_train) I would like to be able to call a function somehow to retrieve the final hidden activation layer vector of length 100 for use in additional tests. Assuming a

MultiClass using LIBSVM

纵饮孤独 提交于 2019-12-12 09:24:03
问题 I have a multiclass svm classification(6 class). I would like to classify it using LIBSVM. The following are the ones that i have tried and i have some questions regarding them. Method1( one vs one): model = svmtrain(TrainLabel, TrainVec, '-c 1 -g 0.00154 -b 0.9'); [predict_label, accuracy, dec_values] = svmpredict(TestLabel, TestVec, model); Two questions about this method: 1) is that all i need to do for multiclass problem 2) what value should it be for n in '-b n'. I m not sure Method 2(

Convert predicted probabilities after downsampling to actual probabilities in classification (using mlr)

ε祈祈猫儿з 提交于 2019-12-12 08:54:27
问题 If I use undersampling in case of an unbalanced binary target variable to train a model, the prediction method calculates probabilities under the assumption of a balanced data set. How can I convert these probabilities to actual probabilities for the unbalanced data? Is the a conversion argument/function implemented in the mlr package or another package? For example: a <- data.frame(y=factor(sample(0:1, prob = c(0.1,0.9), replace=T, size=100))) a$x <- as.numeric(a$y)+rnorm(n=100, sd=1) task <

UserWarning: Label not :NUMBER: is present in all training examples

僤鯓⒐⒋嵵緔 提交于 2019-12-12 08:34:40
问题 I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ ('vectorizer', CountVectorizer(lowercase=True, stop_words='english', max_df = 0.8, min_df = 10)), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) predicted = cross_val_predict(classifier, X, y) When running my code I get

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

ぐ巨炮叔叔 提交于 2019-12-12 08:12:50
问题 I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like: Go through the HTML source of pages from a certain site and "understand" which sections form the content, which the advertisements and which form the metadata ( neither the content, nor the ads - for eg. - TOC, author bio etc ) Go through the HTML source of pages from disparate sites and "classify" whether the