TypeError: a float is required in sklearn.feature_extraction.FeatureHasher

天涯浪子 提交于 2019-12-11 09:55:15

问题


I'm using sklearn version 0.16.1. It seems that FeatureHasher doesn't support strings (as DictVectorizer does). For example:

values = [
          {'city': 'Dubai', 'temperature': 33.},
          {'city': 'London', 'temperature': 12.},
          {'city': 'San Fransisco', 'temperature': 18.}
          ]

print("Starting FeatureHasher ...")
hasher = FeatureHasher(n_features=2)
X = hasher.transform(values).toarray()
print X

But the following error is received:

    _hashing.transform(raw_X, self.n_features, self.dtype)
  File "_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn\feature_extraction\_hashing.c:1762)
TypeError: a float is required

I can't use DictVectorizer since my dataset is very big and the features are with high cardinality so I get a MemoryError. Any suggestions?

Update (October 2016):

As NirIzr commented, this is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173

FeatureHasher should properly handle string dictionary values as of version 0.18.


回答1:


Your best bet for non-numeric features is to transform the keys yourself similar to how DictVectorizer does.

values = [
      {'city_Dubai':1., 'temperature': 33.},
      {'city_London':1., 'temperature': 12.},
      {'city_San Fransisco':1., 'temperature': 18.}
      ]

You could do this with a python function.

def transform_features(orig_dict):
    transformed_dict = dict()
    for name, value in orig_dict.iteritems():
        if isinstance(value , str):
            name = "%s_%s" % (name,value)
            value = 1.
        transformed_dict[name] = value
    return transformed_dict

Example usage:

transform_features({'city_Dubai':1., 'temperature': 33.})
# Returns {'city_Dubai': 1.0, 'temperature': 33.0}



回答2:


This is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173

FeatureHasher should properly handle string dictionary values as of version 0.18.

Keep in mind there are still differences between FeatureHasher and DictVectorizer. Namely, DictVectorizer still handles None values (although I'm curious how), while FeatureHasher explicitly complains about it with the same error OP experienced.

If you're still experiencing the "TypeError: a float is required" with sklearn version >= 0.18, it is probably due to this issue, and you have a None value.

There's no easy way to debug this, and I ended up modifying sklearn's code to catch the TypeError exception and print the last item provided. I did that by editing the _iteritems() function at the top of sklearn/feature_extraction/hashing.py




回答3:


It is a known sklearn issue: FeatureHasher does not currently support string values for its dict input format

https://github.com/scikit-learn/scikit-learn/issues/4878



来源:https://stackoverflow.com/questions/33982717/typeerror-a-float-is-required-in-sklearn-feature-extraction-featurehasher

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!