text-classification

Sklearn SGDC partial_fit ValueError: classes should include all valid labels that can be in y

爷,独闯天下 提交于 2019-12-12 04:01:51
问题 loaded already trained SGDC model and tried to again partial_fit with new features set and data. but received ValueError: classes should include all valid labels that can be in y and my class_weights = None and wanted to have each class equal weights. model_predicted_networktype = joblib.load(f) new_training_data_count_matrix =count_vect_predicted_networktype.transform(training_dataset) new_training_tf_idf = tf_idf(new_training_data_count_matrix) model_predicted_networktype.partial_fit(new

In general, when does TF-IDF reduce accuracy?

狂风中的少年 提交于 2019-12-12 03:48:45
问题 I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it? 回答1: The IDF component of TF*IDF can harm your classification accuracy in some cases. Let suppose

Predicting the “no class” / unrecognised class in Weka Machine Learning

白昼怎懂夜的黑 提交于 2019-12-12 03:27:34
问题 I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category. Category A: 100 txt files Category B: 100 txt files ... Category X: 100 txt files I want to predict if a document falls into one of the categories A-X, OR if it falls in the category UNRECOGNISED (for all other documents). I am getting the total set of Instances programatically like this: private Instances getTotalSet(){ ArrayList<Attribute>

R - Automatic categorization of Wikipedia articles

青春壹個敷衍的年華 提交于 2019-12-12 02:38:40
问题 I have been trying to follow this example by Norbert Ryciak, whom I havent been able to get in touch with. Since this article was written in 2014, some things in R have changed so I have been able to update some of those things in the code, but I got stuck in the last part. Here is my Working code so far: library(tm) library(stringi) library(proxy) wiki <- "https://en.wikipedia.org/wiki/" titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative", "Limit_of_a

How to apply InformationGain in rapidminer with seperate test set ?

梦想的初衷 提交于 2019-12-12 01:43:37
问题 I am dealing with text classification in rapidminer. I have seperate test and training splits. I applied Information Gain to a dataset using n-fold cross validation but i am confused on how to apply it on seperate test set ? Below is attached image In figure i have connected the word list output from first "Process Documents From Files" which is used for training to second "Processed Documents From Files" which is used for testing but i want to apply the reduced feature to the second "Process

Improving the prediction score by use of confidence level of classifiers on instances

≡放荡痞女 提交于 2019-12-12 00:48:09
问题 I am using three classifiers ( RandomForestClassifier , KNearestNeighborClassifier , and SVM Classifier ) which you can see below: >> svm_clf_sl_GS SVC(C=5, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=True, random_state=41, shrinking=True, tol=0.001, verbose=False) >> knn_clf_sl_GS KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=3,

Not able to load keras trained model

余生长醉 提交于 2019-12-11 19:45:12
问题 I am using following code to train HAN Network. Code Link I have trained the model successfully but when I tried to load the model using keras load_model it gives me following error- Unknown layer: AttentionWithContext 回答1: Add the following function in the AttentionWithContext.py file: def create_custom_objects(): instance_holder = {"instance": None} class ClassWrapper(AttentionWithContext): def __init__(self, *args, **kwargs): instance_holder["instance"] = self super(ClassWrapper, self)._

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

流过昼夜 提交于 2019-12-11 15:17:10
问题 I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of : bigrams + trigrams + word-marks vocabulary He means by word-marks here, the words that are specific to a certain dialect. How can I tweak those parameters in countVectorizer? word marks So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them. word_marks=['love', 'funny', 'happy',

Effective classification of natural text in Sci-kit learn/python

不羁的心 提交于 2019-12-11 10:23:31
问题 I want my classification algorithm to classify my natural language based raw data based on a set of category if and only if it is going to meet a certain threshold accuracy with respect to a category(say 80% of accuracy) else I want my classifier to classify that particular raw text to a 'unclassified' category. How do I do this? My example data set: +----------------------+------------+ | Details | Category | +----------------------+------------+ | Any raw text1 | cat1 | +-------------------

Improve flow Python classifier and combine features

本秂侑毒 提交于 2019-12-11 04:42:01
问题 I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this: from sklearn.feature_extraction.text import CountVectorizer countvect_text = CountVectorizer(encoding="cp1252", stop_words="english") countvect_title = CountVectorizer(encoding="cp1252", stop_words="english") countvect_headings =