Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

99封情书 提交于 2019-12-06 11:47:47

问题


I am implementing different classifiers using different machine learning algorithms.

I'm sorting text files, and do as follows:

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

When I use the algorithm GaussianNB the following error occurs:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array.

I saw the following post here

In this post a class is created to perform the transformation of the data. It is possible to adapt my code with TfidfTransformer. How I can fix this?


回答1:


You can do the following:

class DenseTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('to_dense', DenseTransformer()), 
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

Now, as a part of your pipeline, the data will be transform to dense representation.

BTW, I don't know your constraints, but maybe you can use another classifier, such as RandomForestClassifier or SVM that DO accept data in sparse representation.



来源:https://stackoverflow.com/questions/31228303/scikit-learns-pipeline-error-with-multilabel-classification-a-sparse-matrix-w

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!