Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

问题

I am implementing different classifiers using different machine learning algorithms.

I'm sorting text files, and do as follows:

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

When I use the algorithm GaussianNB the following error occurs:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array.

I saw the following post here

In this post a class is created to perform the transformation of the data. It is possible to adapt my code with TfidfTransformer. How I can fix this?

回答1:

You can do the following:

class DenseTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('to_dense', DenseTransformer()), 
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

Now, as a part of your pipeline, the data will be transform to dense representation.

BTW, I don't know your constraints, but maybe you can use another classifier, such as RandomForestClassifier or SVM that DO accept data in sparse representation.

来源：https://stackoverflow.com/questions/31228303/scikit-learns-pipeline-error-with-multilabel-classification-a-sparse-matrix-w

标签

python

scikit-learn

gaussian

text-classification