问题
Apparently, dask writes to the /tmp folder during disk based shuffle operations. On the system that I am using, this folder is mounted on a very small partition (30GB), causing the following error after some calculations:
IOError: [Errno 28] No space left on device
Traceback
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/dataframe/shuffle.py", line 395, in shuffle_group_3
p.append(d, fsync=True)
File "[path_to_anaconda]/lib/python2.7/site-packages/partd/encode.py", line 25, in append
self.partd.append(data, **kwargs)
File "[path_to_anaconda]/lib/python2.7/site-packages/partd/file.py", line 41, in append
f.write(v)
How can I specify the folder that dask uses for the shuffle? What else could I do to avoid this problem? I do not have administrative privileges, therefore mounting /tmp to something bigger is not an option.
So far, I only saw the /tmp folder grow bigger. At which point does dask delete the files?
回答1:
Setting TMPDIR could potentially cause problems as it might also effect other applications. An alternative is to use dask.config.set
>>> import dask
>>> with dask.config.set({'temporary_directory': '/path/to/tmp'}):
... pass
You could also add the lines
temporary_directory: /path/to/tmp
to .dask/config.yaml (in your home directory) configuration docs
回答2:
Setting the TMPDIR environment variable to the desired location via export TMPDIR=/my/path seems to work.
来源:https://stackoverflow.com/questions/40042748/how-to-specify-the-directory-that-dask-uses-for-temporary-files