Export dask groups to csv

问题

I have a single, large, file. It has 40,955,924 lines and is >13GB. I need to be able to separate this file out into individual files based on a single field, if I were using a pd.DataFrame I would use this:

for k, v in df.groupby(['id']):
    v.to_csv(k, sep='\t', header=True, index=False)

However, I get the error KeyError: 'Column not found: 0' there is a solution to this specific error on Iterate over GroupBy object in dask, but this requires using pandas to store a copy of the dataframe, which I cannot do. Any help on splitting this file up would be greatly appreciated.

回答1:

You want to use apply() for this:

def do_to_csv(df):
    df.to_csv(df.name, sep='\t', header=True, index=False)
    return df

df.groupby(['id']).apply(do_to_csv, meta=df._meta).size.compute()

Note - the group key is stored in the dataframe name - we return back the dataframe and supply a meta; this is not really necessary, but you will need to compute on something and it's convenient to know exactly what that thing is - the final output will be the number of rows written.

来源：https://stackoverflow.com/questions/51754608/export-dask-groups-to-csv

标签

python

pandas

pandas-groupby

dask

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!