问题
I have a training set in a Pandas dataframe, and I pass this data frame into model.fit()
with df.values
. Here is some information about the df:
df.values.shape
# (981, 5)
df.values[0]
# array([163, 0.6, 83, 0.52,
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0])], dtype=object)
As you can see, rows in the df contain 5 columns, 4 of which contain numerical values (either int or float), and one which contains a hot encoded array representing some categorical data. I am creating my keras model as seen below:
model = keras.Sequential([
keras.layers.Dense(1024, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(512, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(256, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(128, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(64, activation=tf.nn.relu, kernel_initializer=init_orth, bias_initializer=init_0),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=True)
model.compile(optimizer=opt,
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(df.values, df_labels.values, epochs=10, batch_size=32, verbose=0)
df_labels.values
is just a 1D array of 0s and 1s. So I believe I do need a Dense(1) sigmoid layer at the end, as well as 'binary_crossentropy' loss.
This model works excellent if I only pass numerical data. But as soon as I introduce hot encodings (categorical data), I get this error:
ValueError Traceback (most recent call last)
<ipython-input-91-b5e6232b375f> in <module>
42 #trn_values = df_training_set.values[:,:,len(df_training_set.columns)]
43 #trn_cat = df_trn_wtid.values.reshape(-1, 1)
---> 44 model.fit(df_training_set.values, df_training_labels.values, epochs=10, batch_size=32, verbose=0)
45
46 #test_loss, test_acc = model.evaluate(df_test_set.values, df_test_labels.values)
~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1037 initial_epoch=initial_epoch,
1038 steps_per_epoch=steps_per_epoch,
-> 1039 validation_steps=validation_steps)
1040
1041 def evaluate(self, x=None, y=None,
~\Anaconda3\lib\site-packages\keras\engine\training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
197 ins_batch[i] = ins_batch[i].toarray()
198
--> 199 outs = f(ins_batch)
200 outs = to_list(outs)
201 for l, o in zip(out_labels, outs):
~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in __call__(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):
~\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in _call(self, inputs)
2653 array_vals.append(
2654 np.asarray(value,
-> 2655 dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
2656 if self.feed_dict:
2657 for key in sorted(self.feed_dict.keys()):
~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: setting an array element with a sequence.
Please do not suggest expanding out each value in the one_hot arrays into their own columns. This example is a trimmed down version of my dataset, which contains 6-8 categorical columns, some of the one_hots are arrays of 5000+ size. So this is not a feasible solution for me. I'm looking to perhaps refine my Sequential model (or overhaul the keras model completely) in order to process categorical data along with numerical data.
Remember, the training labels are 1D array of 0/1 values. I need both numerical/categorical training sets predicting one set of outcomes, I can't have one set of predictions from the numerical data and one set of predictions from the categorical data.
回答1:
If flattening the 5000+ one-hot encoded array is a problem, maybe go with an embedding 1st layer instead. Also, what you can do is have a model (defined with the functional API instead of the sequential API as you do) that takes 2 inputs, one for numerical input and another for the categorical data. The categorical data can then go through the embedding and then through a concatenate layer with the numerical input. From there on, your model proceeds as you currently do (1024-cell layer...).
来源:https://stackoverflow.com/questions/55250124/mixing-numerical-and-categorical-data-into-keras-sequential-model-with-dense-lay