How should I vectorize the following list of lists with scikit learn?

橙三吉。 提交于 2020-01-02 01:25:07

问题


I would like to vectorize with scikit learn a list who has lists. I go to the path where I have the training texts I read them and then I obtain something like this:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(analyzer='word')
vect_representation= vect.fit_transform(corpus)
print vect_representation.toarray()

And I get the following:

return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

Also the problem with this are the labels at the end of each document, how should I treat them in order to do a correct classification?.


回答1:


For everybody in the future this solve my problem:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]

from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(splited_labels_from_corpus)

And this is the output, when I use the .toarray() function:

[[0 0 1]
 [1 0 0]
 [0 1 0]]

Thanks guys




回答2:


First you should separate labels from texts. If you want to use CountVectorizer you have to transform your texts one by one:

corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
... split labels from texts
vect = CountVectorizer(analyzer='word')
vect_representation= map(vect.fit_transform,corpus)
...

As another option, you can use TfidfVectorizer with list of lists directly.



来源:https://stackoverflow.com/questions/27673527/how-should-i-vectorize-the-following-list-of-lists-with-scikit-learn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!