How to specify the directory that dask uses for temporary files?

时光总嘲笑我的痴心妄想 提交于 2019-12-10 16:00:29

问题


Apparently, dask writes to the /tmp folder during disk based shuffle operations. On the system that I am using, this folder is mounted on a very small partition (30GB), causing the following error after some calculations:

IOError: [Errno 28] No space left on device

Traceback    

File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "[path_to_anaconda]/lib/python2.7/site-packages/dask/dataframe/shuffle.py", line 395, in shuffle_group_3
p.append(d, fsync=True)
File "[path_to_anaconda]/lib/python2.7/site-packages/partd/encode.py", line 25, in append
self.partd.append(data, **kwargs)
File "[path_to_anaconda]/lib/python2.7/site-packages/partd/file.py", line 41, in append
f.write(v)

How can I specify the folder that dask uses for the shuffle? What else could I do to avoid this problem? I do not have administrative privileges, therefore mounting /tmp to something bigger is not an option.

So far, I only saw the /tmp folder grow bigger. At which point does dask delete the files?


回答1:


Setting TMPDIR could potentially cause problems as it might also effect other applications. An alternative is to use dask.config.set

>>> import dask
>>> with dask.config.set({'temporary_directory': '/path/to/tmp'}):
...     pass

You could also add the lines

temporary_directory: /path/to/tmp

to .dask/config.yaml (in your home directory) configuration docs




回答2:


Setting the TMPDIR environment variable to the desired location via export TMPDIR=/my/path seems to work.



来源:https://stackoverflow.com/questions/40042748/how-to-specify-the-directory-that-dask-uses-for-temporary-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!