text-classification | 易学教程

NLTK accuracy: “ValueError: too many values to unpack”

阅读更多关于 NLTK accuracy: “ValueError: too many values to unpack”

问题 I'm trying to do some sentiment analysis of a new movie from Twitter using the NLTK toolkit. I've followed the NLTK 'movie_reviews' example and I've built my own CategorizedPlaintextCorpusReader object. The problem arises when I call nltk.classify.util.accuracy(classifier, testfeats) . Here is the code: import os import glob import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word

R: LIME returns error on different feature numbers when it's not the case

阅读更多关于 R: LIME returns error on different feature numbers when it's not the case

问题 I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number())

CountVectorizer deleting features that only appear once

阅读更多关于 CountVectorizer deleting features that only appear once

问题 I'm using the sklearn python package, and I am having trouble creating a CountVectorizer with a pre-created dictionary, where the CountVectorizer doesn't delete features that only appear once or don't appear at all. Here is the sample code that I have: train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None) test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names()) print(len(train_count

SVM for text classification in R

阅读更多关于 SVM for text classification in R

I am using SVM to classify my text where in i don't actually get the result instead get with numerical probabilities. Dataframe (1:20 trained set, 21:50 test set) Updated: ou <- structure(list(text = structure(c(1L, 6L, 1L, 1L, 8L, 13L, 24L, 5L, 11L, 12L, 33L, 36L, 20L, 25L, 4L, 19L, 9L, 29L, 22L, 3L, 8L, 8L, 8L, 2L, 8L, 27L, 30L, 3L, 14L, 35L, 3L, 34L, 23L, 31L, 22L, 6L, 6L, 7L, 17L, 3L, 8L, 32L, 18L, 15L, 21L, 26L, 3L, 16L, 10L, 28L), .Label = c("access, access, access, access", "character(0)", "report", "report, access", "report, access, access", "report, access, access, access", "report,

Information Gain calculation with Scikit-learn

阅读更多关于 Information Gain calculation with Scikit-learn

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute . But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia . Is it possible to use a specific setting for mutual

Testing the NLTK classifier on specific file

阅读更多关于 Testing the NLTK classifier on specific file

The following code run Naive Bayes movie review classifier . The code generate a list of the most informative features. Note: **movie review** folder is in the nltk . from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews stop = stopwords.words('english') documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()] word_features = FreqDist(chain(*[i for i,j in

SVM for text classification in R

阅读更多关于 SVM for text classification in R

问题 I am using SVM to classify my text where in i don't actually get the result instead get with numerical probabilities. Dataframe (1:20 trained set, 21:50 test set) Updated: ou <- structure(list(text = structure(c(1L, 6L, 1L, 1L, 8L, 13L, 24L, 5L, 11L, 12L, 33L, 36L, 20L, 25L, 4L, 19L, 9L, 29L, 22L, 3L, 8L, 8L, 8L, 2L, 8L, 27L, 30L, 3L, 14L, 35L, 3L, 34L, 23L, 31L, 22L, 6L, 6L, 7L, 17L, 3L, 8L, 32L, 18L, 15L, 21L, 26L, 3L, 16L, 10L, 28L), .Label = c("access, access, access, access", "character

Sklearn: ROC for multiclass classification

阅读更多关于 Sklearn: ROC for multiclass classification

问题 I'm doing different text classification experiments. Now I need to calculate the AUC-ROC for each task. For the binary classifications, I already made it work with this code: scaler = StandardScaler(with_mean=False) enc = LabelEncoder() y = enc.fit_transform(labels) feat_sel = SelectKBest(mutual_info_classif, k=200) clf = linear_model.LogisticRegression() pipe = Pipeline([('vectorizer', DictVectorizer()), ('scaler', StandardScaler(with_mean=False)), ('mutual_info', feat_sel), ('logistregress'

Information Gain calculation with Scikit-learn

阅读更多关于 Information Gain calculation with Scikit-learn

问题 I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information.

Testing the NLTK classifier on specific file

阅读更多关于 Testing the NLTK classifier on specific file

问题 The following code run Naive Bayes movie review classifier . The code generate a list of the most informative features. Note: **movie review** folder is in the nltk . from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews stop = stopwords.words('english') documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string