Dask opportunistic caching in custom graphs

问题

I have a custom DAG such as:

dag = {'load': (load, 'myfile.txt'),
       'heavy_comp': (heavy_comp, 'load'),
       'simple_comp_1': (sc_1, 'heavy_comp'),
       'simple_comp_2': (sc_2, 'heavy_comp'),
       'simple_comp_3': (sc_3, 'heavy_comp')}

And I'm looking to compute the keys simple_comp_1, simple_comp_2, and simple_comp_3, which I perform as follows,

import dask
from dask.distributed import Client
from dask_yarn import YarnCluster

task_1 = dask.get(dag, 'simple_comp_1')
task_2 = dask.get(dag, 'simple_comp_2')
task_3 = dask.get(dag, 'simple_comp_3')
tasks = [task_1, task_2, task_3]

cluster = YarnCluster()
cluster.scale(3)
client = Client(cluster)
dask.compute(tasks)
cluster.shutdown()

It seems, that without caching, the computation of these 3 keys will lead to the computation of heavy_comp also 3 times. And since this is a heavy computation, I tried to implement opportunistic caching from here as follows:

from dask.cache import Cache
cache = Cache(2e9)
cache.register()

However, when I tried to print the results of what was being cached I got nothing:

>>> cache.cache.data
[]
>>> cache.cache.heap.heap
{}
>>> cache.cache.nbytes
{}

I even tried increasing the cache size to 6GB, however to no effect. Am I doing something wrong? How can I get Dask to cache the result of the key heavy_comp?

回答1:

Expanding on MRocklin's answer and to format code in the comments below the question.

Computing the entire graph at once works as you would expect it to. heavy_comp would only be executed once, which is what you want. Consider the following code you provided in the comments completed by empty function definitions:

def load(fn):
    print('load')
    return fn

def sc_1(i):
    print('sc_1')
    return i

def sc_2(i):
    print('sc_2')
    return i

def sc_3(i):
    print('sc_3')
    return i

def heavy_comp(i):
    print('heavy_comp')
    return i

def merge(*args):
    print('merge')
    return args

dag = {'load': (load, 'myfile.txt'), 'heavy_comp': (heavy_comp, 'load'), 'simple_comp_1': (sc_1, 'heavy_comp'), 'simple_comp_2': (sc_2, 'heavy_comp'), 'simple_comp_3': (sc_3, 'heavy_comp'), 'merger_comp': (merge, 'sc_1', 'sc_2', 'sc_3')}

import dask
result = dask.get(dag, 'merger_comp')
print('result:', result)

It outputs:

load
heavy_comp
sc_1
sc_2
sc_3
merge
result: ('sc_1', 'sc_2', 'sc_3')

As you can see, "heavy_comp" is only printed once, showing that the function heavy_comp has only been executed once.

回答2:

The opportunistic cache in the core Dask library only works for the single-machine scheduler, not the distributed scheduler.

However, if you just compute the entire graph at once Dask will hold onto intermediate values intelligently. If there are values that you would like to hold onto regardless you might also look at the persist function.

来源：https://stackoverflow.com/questions/56959647/dask-opportunistic-caching-in-custom-graphs

标签

dask