NaNs suddenly appearing for sklearn KFolds

妖精的绣舞 提交于 2019-12-08 19:40:35

To solve use .iloc instead of .ix to index your pandas dataframe

for train_index, val_index in kf:
    cv_train_x = X_train.iloc[train_index]
    cv_val_x = X_train.iloc[val_index]
    cv_train_y = y_train[train_index]
    cv_val_y = y_train[val_index]
    print cv_train_x

    logreg = LogisticRegression(C = .01)
    logreg.fit(cv_train_x, cv_train_y)
    pred = logreg.predict(cv_val_x)
    print accuracy_score(cv_val_y, pred)

Indexing with ix is usually equivalent to using .loc which is label based indexing, not index based. While .loc works on X which has a nice integer based indexing/labeling, after cv split this rule is no longer there, you get something like:

        length       tempo  variation
4   509.931973  135.999178   0.001631
2   397.500952  112.347147   0.008146
7   502.083628   99.384014   0.009262
6   763.377778  107.666016   0.002513
5   560.365714  151.999081   0.001620
3  1109.819501  172.265625   0.005367
9   269.001723  117.453835   0.000733

and now you no longer have label 0 or 1, so if you do

X_train.loc[1]

you will get an Exception

KeyError: 'the label [1] is not in the [index]'

However, pandas has a silent error if you request multiple labels, where at least one exists. Thus if you do

 X_train.loc[[1,4]]

you will get

       length       tempo  variation
1         NaN         NaN        NaN
4  509.931973  135.999178   0.001631

As expected - 1 returns NaNs (since it was not found) and 4 represents actual row - since it is inside X_train. In order to solve it - just switch to .iloc or manually rebuild an index of X_train.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!