How to split data into 3 sets (train, validation and test)?

后端 未结 7 552
无人及你
无人及你 2020-11-22 15:03

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data

7条回答
  •  遥遥无期
    2020-11-22 15:45

    In the case of supervised learning, you may want to split both X and y (where X is your input and y the ground truth output). You just have to pay attention to shuffle X and y the same way before splitting.

    Here, either X and y are in the same dataframe, so we shuffle them, separate them and apply the split for each (just like in chosen answer), or X and y are in two different dataframes, so we shuffle X, reorder y the same way as the shuffled X and apply the split to each.

    # 1st case: df contains X and y (where y is the "target" column of df)
    df_shuffled = df.sample(frac=1)
    X_shuffled = df_shuffled.drop("target", axis = 1)
    y_shuffled = df_shuffled["target"]
    
    # 2nd case: X and y are two separated dataframes
    X_shuffled = X.sample(frac=1)
    y_shuffled = y[X_shuffled.index]
    
    # We do the split as in the chosen answer
    X_train, X_validation, X_test = np.split(X_shuffled, [int(0.6*len(X)),int(0.8*len(X))])
    y_train, y_validation, y_test = np.split(y_shuffled, [int(0.6*len(X)),int(0.8*len(X))])
    

提交回复
热议问题