问题
I'm using sklearn version 0.16.1. It seems that FeatureHasher doesn't support strings (as DictVectorizer does). For example:
values = [
{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Fransisco', 'temperature': 18.}
]
print("Starting FeatureHasher ...")
hasher = FeatureHasher(n_features=2)
X = hasher.transform(values).toarray()
print X
But the following error is received:
_hashing.transform(raw_X, self.n_features, self.dtype)
File "_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn\feature_extraction\_hashing.c:1762)
TypeError: a float is required
I can't use DictVectorizer since my dataset is very big and the features are with high cardinality so I get a MemoryError. Any suggestions?
Update (October 2016):
As NirIzr commented, this is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173
FeatureHasher should properly handle string dictionary values as of version 0.18.
回答1:
Your best bet for non-numeric features is to transform the keys yourself similar to how DictVectorizer
does.
values = [
{'city_Dubai':1., 'temperature': 33.},
{'city_London':1., 'temperature': 12.},
{'city_San Fransisco':1., 'temperature': 18.}
]
You could do this with a python function.
def transform_features(orig_dict):
transformed_dict = dict()
for name, value in orig_dict.iteritems():
if isinstance(value , str):
name = "%s_%s" % (name,value)
value = 1.
transformed_dict[name] = value
return transformed_dict
Example usage:
transform_features({'city_Dubai':1., 'temperature': 33.})
# Returns {'city_Dubai': 1.0, 'temperature': 33.0}
回答2:
This is now supported, as sklearn dev team addressed this issue in https://github.com/scikit-learn/scikit-learn/pull/6173
FeatureHasher
should properly handle string dictionary values as of version 0.18.
Keep in mind there are still differences between FeatureHasher
and DictVectorizer
. Namely, DictVectorizer
still handles None
values (although I'm curious how), while FeatureHasher
explicitly complains about it with the same error OP experienced.
If you're still experiencing the "TypeError: a float is required" with sklearn version >= 0.18, it is probably due to this issue, and you have a None
value.
There's no easy way to debug this, and I ended up modifying sklearn's code to catch the TypeError exception and print the last item provided.
I did that by editing the _iteritems()
function at the top of sklearn/feature_extraction/hashing.py
回答3:
It is a known sklearn issue: FeatureHasher does not currently support string values for its dict input format
https://github.com/scikit-learn/scikit-learn/issues/4878
来源:https://stackoverflow.com/questions/33982717/typeerror-a-float-is-required-in-sklearn-feature-extraction-featurehasher