Issue with OneHotEncoder for categorical features

匿名 (未验证) 提交于 2019-12-03 02:03:01

问题:

I want to encode 3 categorical features out of 10 features in my datasets. I use preprocessing from sklearn.preprocessing to do so as the following:

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)

However, I couldn't proceed as I am getting this error:

    array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG

I am surprised why it is complaining about the string as it is supposed to convert it!! Am I missing something here?

回答1:

If you read the docs for OneHotEncoder you'll see the input for fit is "Input array of type int". So you need to do two steps for your one hot encoded data

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.LabelEncoder() enc.fit(cat_features) new_cat_features = enc.transform(cat_features) print new_cat_features # [1 2 0] new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read print ohe.fit_transform(new_cat_features)

Output:

[[ 0.  1.  0.]  [ 0.  0.  1.]  [ 1.  0.  0.]]


回答2:

You can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:

cat_features = ['color', 'director_name', 'actor_2_name'] encoder = LabelBinarizer() new_cat_features = encoder.fit_transform(cat_features) new_cat_features

Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.

Source Hands-On Machine Learning with Scikit-Learn and TensorFlow



回答3:

from the documentation:

categorical_features : all or array of indices or mask Specify what features are treated as categorical. all (default): All features are treated as categorical. array of indices: Array of categorical feature indices. mask: Array of length n_features and with dtype=bool.

column names of pandas dataframe won't work. if you categorical features are column numbers 0, 2 and 6 use :

from sklearn import preprocessing cat_features = [0, 2, 6] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)

It must also be noted that if these categorical features are not label encoded, you need to use LabelEncoder on these features before using OneHotEncoder



回答4:

If the dataset is in pandas data frame, using

pandas.get_dummies

will be more straightforward.

*corrected from pandas.get_getdummies to pandas.get_dummies



回答5:

@Medo,

I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.

Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).

This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.

I agree that the documentation of sklearn.preprocessing.OneHotEncoder is very misleading in that regard.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!