Slicing a Dask Dataframe

时间秒杀一切 提交于 2019-12-09 13:05:03

问题


I have the following code where I like to do a train/test split on a Dask dataframe

df = dd.read_csv(csv_filename, sep=',', encoding="latin-1",
                     names=cols, header=0, dtype='str') 

But when I try to do slices like

for train, test in cv.split(X, y):
    df.fit(X[train], y[train])

it fails with the error

KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index'

Any ideas?


回答1:


Dask.dataframe doesn't support row-wise slicing. It does support the loc operation if you have a sensible index.

However in your case of train/test splitting you will probably be better served by the random_split method.

train, test = df.random_split([0.80, 0.20])

You could also make many splits and concat in different ways

splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20])

for i in range(5):
    trains = [splits[j] for j in range(5) if j != i]
    train = dd.concat(trains, axis=0)
    test = splits[i]


来源:https://stackoverflow.com/questions/44475492/slicing-a-dask-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!