text-classification

NLTK accuracy: “ValueError: too many values to unpack”

こ雲淡風輕ζ 提交于 2019-12-01 12:03:26
问题 I'm trying to do some sentiment analysis of a new movie from Twitter using the NLTK toolkit. I've followed the NLTK 'movie_reviews' example and I've built my own CategorizedPlaintextCorpusReader object. The problem arises when I call nltk.classify.util.accuracy(classifier, testfeats) . Here is the code: import os import glob import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word

R: LIME returns error on different feature numbers when it's not the case

拟墨画扇 提交于 2019-12-01 10:43:24
问题 I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number())

CountVectorizer deleting features that only appear once

风格不统一 提交于 2019-12-01 08:29:06
问题 I'm using the sklearn python package, and I am having trouble creating a CountVectorizer with a pre-created dictionary, where the CountVectorizer doesn't delete features that only appear once or don't appear at all. Here is the sample code that I have: train_count_vect, training_matrix, train_labels = setup_data(train_corpus, query, vocabulary=None) test_count_vect, test_matrix, test_labels = setup_data(test_corpus, query, vocabulary=train_count_vect.get_feature_names()) print(len(train_count

SVM for text classification in R

﹥>﹥吖頭↗ 提交于 2019-11-30 10:33:14
I am using SVM to classify my text where in i don't actually get the result instead get with numerical probabilities. Dataframe (1:20 trained set, 21:50 test set) Updated: ou <- structure(list(text = structure(c(1L, 6L, 1L, 1L, 8L, 13L, 24L, 5L, 11L, 12L, 33L, 36L, 20L, 25L, 4L, 19L, 9L, 29L, 22L, 3L, 8L, 8L, 8L, 2L, 8L, 27L, 30L, 3L, 14L, 35L, 3L, 34L, 23L, 31L, 22L, 6L, 6L, 7L, 17L, 3L, 8L, 32L, 18L, 15L, 21L, 26L, 3L, 16L, 10L, 28L), .Label = c("access, access, access, access", "character(0)", "report", "report, access", "report, access, access", "report, access, access, access", "report,

Information Gain calculation with Scikit-learn

五迷三道 提交于 2019-11-30 04:50:59
I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute . But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information. This matches also the definition in wikipedia . Is it possible to use a specific setting for mutual

Testing the NLTK classifier on specific file

自古美人都是妖i 提交于 2019-11-30 04:01:19
The following code run Naive Bayes movie review classifier . The code generate a list of the most informative features. Note: **movie review** folder is in the nltk . from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews stop = stopwords.words('english') documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()] word_features = FreqDist(chain(*[i for i,j in

SVM for text classification in R

落爺英雄遲暮 提交于 2019-11-29 15:45:27
问题 I am using SVM to classify my text where in i don't actually get the result instead get with numerical probabilities. Dataframe (1:20 trained set, 21:50 test set) Updated: ou <- structure(list(text = structure(c(1L, 6L, 1L, 1L, 8L, 13L, 24L, 5L, 11L, 12L, 33L, 36L, 20L, 25L, 4L, 19L, 9L, 29L, 22L, 3L, 8L, 8L, 8L, 2L, 8L, 27L, 30L, 3L, 14L, 35L, 3L, 34L, 23L, 31L, 22L, 6L, 6L, 7L, 17L, 3L, 8L, 32L, 18L, 15L, 21L, 26L, 3L, 16L, 10L, 28L), .Label = c("access, access, access, access", "character

Sklearn: ROC for multiclass classification

寵の児 提交于 2019-11-29 13:30:49
问题 I'm doing different text classification experiments. Now I need to calculate the AUC-ROC for each task. For the binary classifications, I already made it work with this code: scaler = StandardScaler(with_mean=False) enc = LabelEncoder() y = enc.fit_transform(labels) feat_sel = SelectKBest(mutual_info_classif, k=200) clf = linear_model.LogisticRegression() pipe = Pipeline([('vectorizer', DictVectorizer()), ('scaler', StandardScaler(with_mean=False)), ('mutual_info', feat_sel), ('logistregress'

Information Gain calculation with Scikit-learn

强颜欢笑 提交于 2019-11-29 02:14:41
问题 I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. The Information Gain is defined as H(Class) - H(Class | Attribute), where H is the entropy. Using weka, this can be accomplished with the InfoGainAttribute. But I haven't found this measure in scikit-learn. However, it has been suggested that the formula above for Information Gain is the same measure as mutual information.

Testing the NLTK classifier on specific file

荒凉一梦 提交于 2019-11-29 01:41:23
问题 The following code run Naive Bayes movie review classifier . The code generate a list of the most informative features. Note: **movie review** folder is in the nltk . from itertools import chain from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews stop = stopwords.words('english') documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string