I am implementing different classifiers using different machine learning algorithms.
I'm sorting text files, and do as follows:
classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)
When I use the algorithm GaussianNB the following error occurs:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array.
I saw the following post here
In this post a class is created to perform the transformation of the data. It is possible to adapt my code with TfidfTransformer. How I can fix this?
You can do the following:
class DenseTransformer(TransformerMixin):
def transform(self, X, y=None, **fit_params):
return X.todense()
def fit_transform(self, X, y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)
def fit(self, X, y=None, **fit_params):
return self
classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('to_dense', DenseTransformer()),
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)
Now, as a part of your pipeline, the data will be transform to dense representation.
BTW, I don't know your constraints, but maybe you can use another classifier, such as RandomForestClassifier or SVM that DO accept data in sparse representation.
来源:https://stackoverflow.com/questions/31228303/scikit-learns-pipeline-error-with-multilabel-classification-a-sparse-matrix-w