Dask scheduler empty / graph not showing

£可爱£侵袭症+ 提交于 2020-12-15 06:40:00

问题


I have a setup as follows:

# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3

def runall(**kwargs):
    print("done")


def etl():
    client = Client()

    tasks = {}
    tasks['task1'] = dask.delayed(task)(*args)
    tasks['task2'] = dask.delayed(task)(*args)
    tasks['task3'] = dask.delayed(task)(*args)

     out = dask.delayed(runall)(**tasks)
     out.compute()

This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.

However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy with tuned settings. Would adding async and await help?

However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe or bag internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe and bag. The entire script is run on a single 40 core machine.

Using bag.starmap I get a graph like this:

where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.


回答1:


There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.



来源:https://stackoverflow.com/questions/64911735/dask-scheduler-empty-graph-not-showing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!