unable to use FeatureUnion in scikit-learn due to different dimensions

放肆的年华 提交于 2019-12-23 07:04:04

问题


I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions


Implementaion

My FeatureUnion is built the following way:

    features = FeatureUnion([
        ('f1', Pipeline([
            ('get', GetItemTransformer('f1')),
            ('transform', vectorizer_f1)
        ])),
        ('f2', Pipeline([
            ('get', GetItemTransformer('f2')),
            ('transform', vectorizer_f1)
        ]))
    ])

GetItemTransformer is used to get different parts of data out of the same structure. The Idea is described here in the scikit-learn issue-tracker.

The Structure itself is stored as {'f1': data_f1, 'f2': data_f2} where data_f1 are different lists with different lengths.


Question

Since the Y-Vector is different to the Data-Fields I assume that the error occurs, but how can I scale the vector to fit in both cases?


回答1:


Here's what worked for me:

class ArrayCaster(BaseEstimator, TransformerMixin):
  def fit(self, x, y=None):
    return self

  def transform(self, data):
    print data.shape
    print np.transpose(np.matrix(data)).shape
    return np.transpose(np.matrix(data))

FeatureUnion([('text', Pipeline([
            ('selector', ItemSelector(key='text')),
            ('vect', CountVectorizer(ngram_range=(1,1), binary=True, min_df=3)),
            ('tfidf', TfidfTransformer())
          ])
        ),

        ('other data', Pipeline([
            ('selector', ItemSelector(key='has_foriegn_char')),
            ('caster', ArrayCaster())
          ])
        )])



回答2:


I don't know if this applies to your question, but we ran into the same error in a slightly different situation and just solved it.

Our f1 entries were each lists of 15 numeric values and we needed to do tf-idf on f2. This generated the same error about incompatible row dimensions.

After running it through the debugger, we found that the shapes of our matrices were subtly different going into the hstack() call in FeatureUnion: (2569,) and (2659, 706).

If we cast f1 to a 2D numpy array, the shape changed to (2659, 15) and the hstack call works.

The cast was something like this: f1 = np.array(list(f1)).



来源:https://stackoverflow.com/questions/25795511/unable-to-use-featureunion-in-scikit-learn-due-to-different-dimensions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!