问题
I want to get the top 10 frequency of words from each topic, and after I use TfidfTransformer, I get: and the type is scipy.sparse.csr.csr_matrix
But I don't know how to get the highest ten from each list, in the data, (0, ****) means the 0 list, until (5170, *****) means the 5170 list.
I've tried to convert it into numpy, but it fails.
(0, 19016) 0.024214182003181053
(0, 28002) 0.03661443306612277
(0, 6710) 0.02292100371816788
(0, 27683) 0.013973969726506812
(0, 27104) 0.02236713272585597
(0, 6889) 0.0403281034949193
.
.
.
(5169, 3236) 0.014432449220428715
(5169, 19134) 0.014346823328868169
(5169, 32915) 0.002047199186262409
(5170, 35899) 0.49931779368675605
(5170, 36444) 0.3479717717856863
(5170, 15014) 0.5608169649159123
回答1:
You can use the TfidfVectorizer to expose the get_feature_names method. The transformer doesn't have this method, but the docs clearly state that the Vectorizer is equivalent to CountVectorizer followed by the transformer. If you don't want to use this, then I think you're going to be stuck building a lookup before you vectorize.
TfidfVectorizer in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Edit: to sort and slice the output of fit_transform from the TfidfVectorizer normal sparse matrix operations should work.
来源:https://stackoverflow.com/questions/53193422/sklearn-how-to-get-the-10-words-from-each-topic