问题
I'm using sklearn version 0.16.1. It seems that FeatureHasher doesn't support strings (as DictVectorizer does). For example:
values = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Fransisco', 'temperature': 18.}
]
print("Starting FeatureHasher ...")
hasher = FeatureHasher(n_features=2)
X = hasher.transform(values).toarray()
print X
But the following error is received:
_hashing.transform(raw_X, self.n_features, self.dtype)
File "_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn\feature_extraction\_hashing.c:1762)
TypeError: a float is required
I can't use DictVectorizer since my dataset is very big and the features are with high cardinality so I get a MemoryError. Any suggestions?
Update (October 2016):
As NirIzr commented, this is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173
FeatureHasher should properly handle string dictionary values as of version 0.18.
回答1:
Your best bet for non-numeric features is to transform the keys yourself similar to how DictVectorizer does.
values = [
{'city_Dubai':1., 'temperature': 33.},
{'city_London':1., 'temperature': 12.},
{'city_San Fransisco':1., 'temperature': 18.}
]
You could do this with a python function.
def transform_features(orig_dict):
transformed_dict = dict()
for name, value in orig_dict.iteritems():
if isinstance(value , str):
name = "%s_%s" % (name,value)
value = 1.
transformed_dict[name] = value
return transformed_dict
Example usage:
transform_features({'city_Dubai':1., 'temperature': 33.})
# Returns {'city_Dubai': 1.0, 'temperature': 33.0}
回答2:
This is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173
FeatureHasher should properly handle string dictionary values as of version 0.18.
Keep in mind there are still differences between FeatureHasher and DictVectorizer. Namely, DictVectorizer still handles None values (although I'm curious how), while FeatureHasher explicitly complains about it with the same error OP experienced.
If you're still experiencing the "TypeError: a float is required" with sklearn version >= 0.18, it is probably due to this issue, and you have a None value.
There's no easy way to debug this, and I ended up modifying sklearn's code to catch the TypeError exception and print the last item provided.
I did that by editing the _iteritems() function at the top of sklearn/feature_extraction/hashing.py
回答3:
It is a known sklearn issue: FeatureHasher does not currently support string values for its dict input format
https://github.com/scikit-learn/scikit-learn/issues/4878
来源:https://stackoverflow.com/questions/33982717/typeerror-a-float-is-required-in-sklearn-feature-extraction-featurehasher