问题
I have a list of colors:
initialColors = [u'black' u'black' u'black' u'white' u'white' u'white' u'powderblue'
u'whitesmoke' u'black' u'cornflowerblue' u'powderblue' u'powderblue'
u'goldenrod' u'white' u'lavender' u'white' u'powderblue' u'powderblue'
u'powderblue' u'powderblue' u'powderblue' u'powderblue' u'powderblue'
u'powderblue' u'white' u'white' u'powderblue' u'white' u'white']
And I have a labels for these colors like this:
labels_train = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0 means that a color is chosen by female, 1 means male. And I am going to predict a gender using another one array of colors.
So, for my initial colors I turn the name into numerical feature vectors like this:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(initialColors)
features_train = le.transform(initialColors)
After that my features_train looks like:
[0 0 0 5 5 5 4 6 0 1 4 4 2 5 3 5 4 4 4 4 4 4 4 4 5 5 4 5 5]
And finally, I do:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)
But I've got an error:
/Library/Python/2.7/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Traceback (most recent call last):
File "app.py", line 36, in <module>
clf.fit(features_train, labels_train)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 182, in fit
X, y = check_X_y(X, y)
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 531, in check_X_y
check_consistent_length(X, y)
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 181, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 70]
I did:
features_train = features_train.reshape(-1, 1)
labels_train = labels_train.reshape(-1, 1)
clf.fit(features_train, labels_train)
I've got an error:
/Library/Python/2.7/site-packages/sklearn/utils/validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
I also tried:
features_train = features_train.reshape(1, -1)
labels_train = labels_train.reshape(1, -1)
But anyway:
Traceback (most recent call last):
File "app.py", line 36, in <module>
clf.fit(features_train, labels_train)
File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 182, in fit
X, y = check_X_y(X, y)
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 526, in check_X_y
y = column_or_1d(y, warn=True)
File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1, 29)
My problem is that I don't understand what is the best way to reshape a data in my case. Can you please a help me to choose a way to reshape my data?
回答1:
Quick answer:
- Do
features_train = features_train.reshape(-1, 1); - Do NOT do
labels_train = labels_train.reshape(-1, 1). Leavelabels_trainas it is.
Some details:
It seems you are confused about the why 2D data array input is required for estimators. Your training vectors X has a shape (n_samples, n_features). So features_train.reshape(-1, 1) is correct for your case here, since you have only 1 feature and want to let numpy to infer how many samples are there. This indeed solves your first error.
Your target values y has a shape (n_samples,), which expects a 1D array. When you do labels_train = labels_train.reshape(-1, 1), you convert it to a 2D column-vector. That's why you got the second warning. Note that it's a warning, meaning fit() figured it out and did the correct conversion, i.e. your program continues to run and should be correct.
When you do:
features_train = features_train.reshape(1, -1)
labels_train = labels_train.reshape(1, -1)
First, it is a wrong conversion for features_train for your case here because X.reshape(1, -1) means you have 1 sample and want to let numpy to infer how many features are there. It is not what you want but fit() doesn't know and will process it accordingly, giving you the wrong result.
That being said, your last error does not come from features_train = features_train.reshape(1, -1). It is from labels_train = labels_train.reshape(1, -1). Your labels_train has now a shape (1, 29) which is neither a row nor a column-vector. Though we might know it should be interpreted as a 1D array of target values, fit() is not that smart yet and don't know what to do with it.
来源:https://stackoverflow.com/questions/44993977/reshape-a-data-for-sklearn