问题
I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is:
cluster = LocalCluster(n_workers=6, threads_per_worker=1)
client = Client(cluster, memory_limit='1GB')
df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7)
df['new_col'] = df.map_partitions(lambda x: some_function(x))
df = df.set_index(df.new_col, sorted=False)
However, when I use large files (i.e. > 15gb) I run into a memory error when saving to dataframe to csv with:
df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
I have tried setting the chunksize=1000000 to see if this would help, but it didn't.
The full stack trace is:
Traceback (most recent call last):
File "/home/david/data/pointframes/examples/dask_z-order.py", line 44, in <module>
calc_zorder(fp, save_dir)
File "/home/david/data/pointframes/examples/dask_z-order.py", line 31, in calc_zorder
df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 1159, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.py", line 654, in to_csv
delayed(values).compute(scheduler=scheduler)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 398, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 76, in get
pack_exception=pack_exception, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 459, in get_async
raise_exception(exc, tb)
File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 230, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 118, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 426, in collect
res = p.get(part)
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
return self.get([keys], **kwargs)[0]
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
return self._get(keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
for chunk in raw]
File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 175, in deserialize
for (h, b) in zip(headers[2:], bytes[2:])]
File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 136, in block_from_header_bytes
copy=True).reshape(shape)
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 126, in deserialize
result = result.copy()
MemoryError
I am running dask v1.1.0 on a Ubuntu 18.04 system in python 2.7. My computers memory is 32GB. This code works as expected with small files that can fit into memory anyway but not with larger ones. Is there something I am missing here?
回答1:
I encourage you to try smaller chunks of data. You should control this in the read_csv part of your computation rather than the to_csv part.
来源:https://stackoverflow.com/questions/54459056/dask-memory-error-when-running-df-to-csv