问题
My dataset consists of time series(10080) and other descriptive statistics features(85) joint into one row. DataFrame is 921 x 10166
.
The data looks something like this, with last 2 columns as Y
(labels).
id x0 x1 x2 x3 x4 x5 ... x10079 mean var ... Y0 Y1
1 40 31.05 25.5 25.5 25.5 25 ... 33 24 1 1 0
2 35 35.75 36.5 26.5 36.5 36.5 ... 29 31 2 0 1
3 35 35.70 36.5 36.5 36.5 36.5 ... 29 25 1 1 0
4 40 31.50 23.5 24.5 26.5 25 ... 33 29 3 0 1
...
921 40 31.05 25.5 25.5 25.5 25 ... 23 33 2 0 1
I checked a few blogs and tutorials which are helpful but I am not sure about how to deal with my input data which I had divided into inputs_1
and inputs_2
as shown in the model below:
inputs_1 = keras.Input(shape=(10081,1))
layer1 = Conv1D(64,14)(inputs_1)
layer2 = layers.MaxPool1D(5)(layer1)
layer3 = Conv1D(64, 14)(layer2)
layer4 = layers.GlobalMaxPooling1D()(layer3)
inputs_2 = keras.Input(shape=(85,))
layer5 = layers.concatenate([layer4, inputs_2])
layer6 = Dense(128, activation='relu')(layer5)
layer7 = Dense(2, activation='softmax')(layer6)
model_2 = keras.models.Model(inputs = [inputs_1, inputs_2], output = [layer7])
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:10166], merge[['Result_cat','Result_cat1']].values, test_size=0.2)
X_train = X_train.to_numpy()
X_train = X_train.reshape([X_train.shape[0], X_train.shape[1], 1])
X_train_1 = X_train[:,0:10081,:]
X_train_2 = X_train[:,10081:10166,:].reshape(736,85)
X_test = X_test.to_numpy()
X_test = X_test.reshape([X_test.shape[0], X_test.shape[1], 1])
X_test_1 = X_test[:,0:10081,:]
X_test_2 = X_test[:,10081:10166,:].reshape(185,85)
adam = keras.optimizers.Adam(lr = 0.0005)
model_2.compile(loss = 'categorical_crossentropy', optimizer = adam, metrics = ['acc'])
history = model_2.fit([X_train_1,X_train_2], y_train, epochs = 120, batch_size = 256, validation_split = 0.2, callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)])
The reason of dividing the features into 2 parts is that inputs_1
is mainly about the time series data, while inputs_2
is the descriptive statistics data. I thought it'd be better to separate them given the different nature of data. Please correct me if I'm wrong.
My question is, since my features data is divided and treated separately in the original model, should I do the same in cross validation(treat inputs_1
and inputs_2
separately)? In particular, for example, in Jason's model:
# MLP for Pima Indians Dataset with 10-fold cross validation
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import StratifiedKFold
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(X, Y):
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)
# evaluate the model
scores = model.evaluate(X[test], Y[test], verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
evaluation was done using code scores = model.evaluate(X[test], Y[test], verbose=0)
where X[test], Y[test]
were used. In my case, since I have inputs_1
and inputs_2
instead of X
(in example model), should I use something like [inputs_1,inputs_2][test]
?
Any advice is appreciated. Thanks
Update:
I tried to concatenate inputs_1
and inputs_2
with
con_x = np.concatenate((X_train_1,X_train_2), axis = 1)
and changed the first line of model to
for train, test in kfold.split(con_x, Y):
but it returned
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-d53a7058d157> in <module>()
55 cvscores = []
---> 56 for train, test in kfold.split(con_x, Y):
57
58 inputs_1 = keras.Input(shape=(10080,1))
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
537 if not allow_nd and array.ndim >= 3:
538 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 539 % (array.ndim, estimator_name))
540 if force_all_finite:
541 _assert_all_finite(array,
ValueError: Found array with dim 3. Estimator expected <= 2.
But still, I am not sure if it is valid to concatenate inputs_1
and inputs_2
like this.
来源:https://stackoverflow.com/questions/59277549/how-to-do-cross-validation-with-multiple-input-data-in-cnn-model-with-keras