Pandas dataframes too large to append to dask dataframe?

不羁岁月 提交于 2020-08-20 02:37:24

问题


I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

Below is the basic process I used:

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')
  • I've tried a variety of npartitions but no number has allowed the code to finish running.
  • all in all there is about 30GB of pickled dataframes I'd like to combine
  • perhaps this is not the right library but the docs suggest dask should be able to handle this

回答1:


Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

import pandas as pd
import dask.dataframe as dd
import os

def convert_to_parquet(fn, fldr_in, fldr_out):
    fn_out = fn.replace(fldr_in, fldr_out)\
               .replace(".pickle", ".parquet")
    df = pd.read_pickle(fn)
    # eventually change dtypes
    
    df.to_parquet(fn_out, index=False)

fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)

# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]

If you know than no more than one file fits in memory you should use a loop

for fn in fns:
    convert_to_parquet(fn, fldr_in, fldr_out)

If you know that more files fit in memory you can use delayed

from dask import delayed, compute

# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)

Now you can use dask to do your analysis.



来源:https://stackoverflow.com/questions/63252135/pandas-dataframes-too-large-to-append-to-dask-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!