Dask Memory Error when running df.to_csv()

问题

I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is:

cluster = LocalCluster(n_workers=6, threads_per_worker=1)
client = Client(cluster, memory_limit='1GB')

df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7)
df['new_col'] = df.map_partitions(lambda x: some_function(x))
df = df.set_index(df.new_col, sorted=False)

However, when I use large files (i.e. > 15gb) I run into a memory error when saving to dataframe to csv with:

df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)

I have tried setting the chunksize=1000000 to see if this would help, but it didn't.

The full stack trace is:

Traceback (most recent call last):
  File "/home/david/data/pointframes/examples/dask_z-order.py", line 44, in <module>
    calc_zorder(fp, save_dir)
  File "/home/david/data/pointframes/examples/dask_z-order.py", line 31, in calc_zorder
    df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 1159, in to_csv
    return to_csv(self, filename, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.py", line 654, in to_csv
    delayed(values).compute(scheduler=scheduler)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 459, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 118, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 426, in collect
    res = p.get(part)
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
    return self.get([keys], **kwargs)[0]
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
    return self._get(keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
    for chunk in raw]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 175, in deserialize
    for (h, b) in zip(headers[2:], bytes[2:])]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 136, in block_from_header_bytes
    copy=True).reshape(shape)
  File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 126, in deserialize
    result = result.copy()
MemoryError

I am running dask v1.1.0 on a Ubuntu 18.04 system in python 2.7. My computers memory is 32GB. This code works as expected with small files that can fit into memory anyway but not with larger ones. Is there something I am missing here?

回答1:

I encourage you to try smaller chunks of data. You should control this in the read_csv part of your computation rather than the to_csv part.

来源：https://stackoverflow.com/questions/54459056/dask-memory-error-when-running-df-to-csv

标签

python

pandas

dask