How can i get highest frequency terms out of TD-idf vectors , for each files in scikit-learn?

房东的猫 提交于 2019-12-12 07:39:31

问题


I am trying to get Highest frequency terms out of vectors in scikit-learn. From example It can be done using this for each Categories but i want it for each files inside categories.

https://github.com/scikit-learn/scikit-learn/blob/master/examples/document_classification_20newsgroups.py

    if opts.print_top10:
        print "top 10 keywords per class:"
        for i, category in enumerate(categories):
            top10 = np.argsort(clf.coef_[i])[-10:]
            print trim("%s: %s" % (
            category, " ".join(feature_names[top10])))

I want to do this for each files from testing dataset instead of each categories. Where should i be looking?

Thanks

EDIT: s/discrimitive/highest frequency/g (Sorry for the confusions)


回答1:


You can use the result of transform together with get_feature_names to obtain the term counts for a given document.

X = vectorizer.transform(docs)
terms = np.array(vectorizer.get_feature_names())
terms_for_first_doc = zip(terms, X.toarray()[0])



回答2:


Seems nobody know . I am answering here as other people face the same problem , i got where to look for now , have not fully implement it yet.

it lies deep inside CountVectorizer from sklearn.feature_extraction.text :

def transform(self, raw_documents):
    """Extract token counts out of raw text documents using the vocabulary
    fitted with fit or the one provided in the constructor.

    Parameters
    ----------
    raw_documents: iterable
        an iterable which yields either str, unicode or file objects

    Returns
    -------
    vectors: sparse matrix, [n_samples, n_features]
    """
    if not hasattr(self, 'vocabulary_') or len(self.vocabulary_) == 0:
        raise ValueError("Vocabulary wasn't fitted or is empty!")

    # raw_documents can be an iterable so we don't know its size in
    # advance

    # XXX @larsmans tried to parallelize the following loop with joblib.
    # The result was some 20% slower than the serial version.
    analyze = self.build_analyzer()
    term_counts_per_doc = [Counter(analyze(doc)) for doc in raw_documents] # <<-- added here
    self.test_term_counts_per_doc=deepcopy(term_counts_per_doc)
    return self._term_count_dicts_to_matrix(term_counts_per_doc)

I have added self.test_term_counts_per_doc=deepcopy(term_counts_per_doc) and it make it able to call from vectorizer outside like this :

load_files = recursive_load_files
trainer_path = os.path.realpath(trainer_path)
tester_path = os.path.realpath(tester_path)
data_train = load_files(trainer_path, load_content = True, shuffle = False)


data_test = load_files(tester_path, load_content = True, shuffle = False)
print 'data loaded'

categories = None    # for case categories == None

print "%d documents (training set)" % len(data_train.data)
print "%d documents (testing set)" % len(data_test.data)
#print "%d categories" % len(categories)
print

# split a training set and a test set

print "Extracting features from the training dataset using a sparse vectorizer"
t0 = time()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.7,
                             stop_words='english',charset_error="ignore")

X_train = vectorizer.fit_transform(data_train.data)


print "done in %fs" % (time() - t0)
print "n_samples: %d, n_features: %d" % X_train.shape
print

print "Extracting features from the test dataset using the same vectorizer"
t0 = time()
X_test = vectorizer.transform(data_test.data)
print "Test printing terms per document"
for counter in vectorizer.test_term_counts_per_doc:
    print counter

here is my fork , i also submitted pull requests:

https://github.com/v3ss0n/scikit-learn

Please suggest me if there any better way to do.



来源:https://stackoverflow.com/questions/13181409/how-can-i-get-highest-frequency-terms-out-of-td-idf-vectors-for-each-files-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!