text-classification | 易学教程

Sklearn SGDC partial_fit ValueError: classes should include all valid labels that can be in y

阅读更多关于 Sklearn SGDC partial_fit ValueError: classes should include all valid labels that can be in y

问题 loaded already trained SGDC model and tried to again partial_fit with new features set and data. but received ValueError: classes should include all valid labels that can be in y and my class_weights = None and wanted to have each class equal weights. model_predicted_networktype = joblib.load(f) new_training_data_count_matrix =count_vect_predicted_networktype.transform(training_dataset) new_training_tf_idf = tf_idf(new_training_data_count_matrix) model_predicted_networktype.partial_fit(new

In general, when does TF-IDF reduce accuracy?

阅读更多关于 In general, when does TF-IDF reduce accuracy?

问题 I'm training a corpus consisting of 200000 reviews into positive and negative reviews using a Naive Bayes model, and I noticed that performing TF-IDF actually reduced the accuracy (while testing on test set of 50000 reviews) by about 2%. So I was wondering if TF-IDF has any underlying assumptions on the data or model that it works with, i.e. any cases where accuracy is reduced by the use of it? 回答1: The IDF component of TF*IDF can harm your classification accuracy in some cases. Let suppose

Predicting the “no class” / unrecognised class in Weka Machine Learning

阅读更多关于 Predicting the “no class” / unrecognised class in Weka Machine Learning

问题 I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category. Category A: 100 txt files Category B: 100 txt files ... Category X: 100 txt files I want to predict if a document falls into one of the categories A-X, OR if it falls in the category UNRECOGNISED (for all other documents). I am getting the total set of Instances programatically like this: private Instances getTotalSet(){ ArrayList<Attribute>

R - Automatic categorization of Wikipedia articles

阅读更多关于 R - Automatic categorization of Wikipedia articles

问题 I have been trying to follow this example by Norbert Ryciak, whom I havent been able to get in touch with. Since this article was written in 2014, some things in R have changed so I have been able to update some of those things in the code, but I got stuck in the last part. Here is my Working code so far: library(tm) library(stringi) library(proxy) wiki <- "https://en.wikipedia.org/wiki/" titles <- c("Integral", "Riemann_integral", "Riemann-Stieltjes_integral", "Derivative", "Limit_of_a

How to apply InformationGain in rapidminer with seperate test set ?

阅读更多关于 How to apply InformationGain in rapidminer with seperate test set ?

问题 I am dealing with text classification in rapidminer. I have seperate test and training splits. I applied Information Gain to a dataset using n-fold cross validation but i am confused on how to apply it on seperate test set ? Below is attached image In figure i have connected the word list output from first "Process Documents From Files" which is used for training to second "Processed Documents From Files" which is used for testing but i want to apply the reduced feature to the second "Process

Improving the prediction score by use of confidence level of classifiers on instances

阅读更多关于 Improving the prediction score by use of confidence level of classifiers on instances

问题 I am using three classifiers ( RandomForestClassifier , KNearestNeighborClassifier , and SVM Classifier ) which you can see below: >> svm_clf_sl_GS SVC(C=5, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=True, random_state=41, shrinking=True, tol=0.001, verbose=False) >> knn_clf_sl_GS KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=3,

Not able to load keras trained model

阅读更多关于 Not able to load keras trained model

问题 I am using following code to train HAN Network. Code Link I have trained the model successfully but when I tried to load the model using keras load_model it gives me following error- Unknown layer: AttentionWithContext 回答1: Add the following function in the AttentionWithContext.py file: def create_custom_objects(): instance_holder = {"instance": None} class ClassWrapper(AttentionWithContext): def __init__(self, *args, **kwargs): instance_holder["instance"] = self super(ClassWrapper, self)._

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

阅读更多关于 How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

问题 I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of : bigrams + trigrams + word-marks vocabulary He means by word-marks here, the words that are specific to a certain dialect. How can I tweak those parameters in countVectorizer? word marks So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them. word_marks=['love', 'funny', 'happy',

Effective classification of natural text in Sci-kit learn/python

阅读更多关于 Effective classification of natural text in Sci-kit learn/python

问题 I want my classification algorithm to classify my natural language based raw data based on a set of category if and only if it is going to meet a certain threshold accuracy with respect to a category(say 80% of accuracy) else I want my classifier to classify that particular raw text to a 'unclassified' category. How do I do this? My example data set: +----------------------+------------+ | Details | Category | +----------------------+------------+ | Any raw text1 | cat1 | +-------------------

Improve flow Python classifier and combine features

阅读更多关于 Improve flow Python classifier and combine features

问题 I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this: from sklearn.feature_extraction.text import CountVectorizer countvect_text = CountVectorizer(encoding="cp1252", stop_words="english") countvect_title = CountVectorizer(encoding="cp1252", stop_words="english") countvect_headings =