TypeError: a float is required in sklearn.feature_extraction.FeatureHasher

问题

I'm using sklearn version 0.16.1. It seems that FeatureHasher doesn't support strings (as DictVectorizer does). For example:

values = [
          {'city': 'Dubai', 'temperature': 33.},
          {'city': 'London', 'temperature': 12.},
          {'city': 'San Fransisco', 'temperature': 18.}
          ]

print("Starting FeatureHasher ...")
hasher = FeatureHasher(n_features=2)
X = hasher.transform(values).toarray()
print X

But the following error is received:

    _hashing.transform(raw_X, self.n_features, self.dtype)
  File "_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn\feature_extraction\_hashing.c:1762)
TypeError: a float is required

I can't use DictVectorizer since my dataset is very big and the features are with high cardinality so I get a MemoryError. Any suggestions?

Update (October 2016):

As NirIzr commented, this is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173

FeatureHasher should properly handle string dictionary values as of version 0.18.

回答1:

Your best bet for non-numeric features is to transform the keys yourself similar to how DictVectorizer does.

values = [
      {'city_Dubai':1., 'temperature': 33.},
      {'city_London':1., 'temperature': 12.},
      {'city_San Fransisco':1., 'temperature': 18.}
      ]

You could do this with a python function.

def transform_features(orig_dict):
    transformed_dict = dict()
    for name, value in orig_dict.iteritems():
        if isinstance(value , str):
            name = "%s_%s" % (name,value)
            value = 1.
        transformed_dict[name] = value
    return transformed_dict

Example usage:

transform_features({'city_Dubai':1., 'temperature': 33.})
# Returns {'city_Dubai': 1.0, 'temperature': 33.0}

回答2:

This is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173

FeatureHasher should properly handle string dictionary values as of version 0.18.

Keep in mind there are still differences between FeatureHasher and DictVectorizer. Namely, DictVectorizer still handles None values (although I'm curious how), while FeatureHasher explicitly complains about it with the same error OP experienced.

If you're still experiencing the "TypeError: a float is required" with sklearn version >= 0.18, it is probably due to this issue, and you have a None value.

There's no easy way to debug this, and I ended up modifying sklearn's code to catch the TypeError exception and print the last item provided. I did that by editing the _iteritems() function at the top of sklearn/feature_extraction/hashing.py

回答3:

It is a known sklearn issue: FeatureHasher does not currently support string values for its dict input format

https://github.com/scikit-learn/scikit-learn/issues/4878

来源：https://stackoverflow.com/questions/33982717/typeerror-a-float-is-required-in-sklearn-feature-extraction-featurehasher

标签

python

hash

scikit-learn

feature-extraction