dask-delayed

Long running workers blocking GIL timeout errors

耗尽温柔 提交于 2021-02-18 18:55:47
问题 I'm using dask-distributed with a local setup (LocalCluster with 5 workers) on a dask.delayed workload. Most of the work is done by the vtk Python bindings. Since vtk is C++ based I think that means the workers don't release the GIL when in a long-running statement. When I run the workload, my terminal prints out a bunch of errors like this: Traceback (most recent call last): File "C:\Users\patri\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\comm\core.py", line 221, in

How should I write multiple CSV files efficiently using dask.dataframe?

丶灬走出姿态 提交于 2021-02-10 04:46:22
问题 Here is the summary of what I'm doing: At first, I do this by normal multiprocessing and pandas package: Step 1. Get the list of files name which I'm gonna to read import os files = os.listdir(DATA_PATH + product) Step 2. loop over the list from multiprocessing import Pool import pandas as pd def readAndWriteCsvFiles(file): ### Step 2.1 read csv file into dataframe data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False) ### Step 2.2 do some

How should I write multiple CSV files efficiently using dask.dataframe?

可紊 提交于 2021-02-10 04:43:33
问题 Here is the summary of what I'm doing: At first, I do this by normal multiprocessing and pandas package: Step 1. Get the list of files name which I'm gonna to read import os files = os.listdir(DATA_PATH + product) Step 2. loop over the list from multiprocessing import Pool import pandas as pd def readAndWriteCsvFiles(file): ### Step 2.1 read csv file into dataframe data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False) ### Step 2.2 do some

How should I write multiple CSV files efficiently using dask.dataframe?

无人久伴 提交于 2021-02-10 04:43:24
问题 Here is the summary of what I'm doing: At first, I do this by normal multiprocessing and pandas package: Step 1. Get the list of files name which I'm gonna to read import os files = os.listdir(DATA_PATH + product) Step 2. loop over the list from multiprocessing import Pool import pandas as pd def readAndWriteCsvFiles(file): ### Step 2.1 read csv file into dataframe data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False) ### Step 2.2 do some

How should I write multiple CSV files efficiently using dask.dataframe?

牧云@^-^@ 提交于 2021-02-10 04:42:09
问题 Here is the summary of what I'm doing: At first, I do this by normal multiprocessing and pandas package: Step 1. Get the list of files name which I'm gonna to read import os files = os.listdir(DATA_PATH + product) Step 2. loop over the list from multiprocessing import Pool import pandas as pd def readAndWriteCsvFiles(file): ### Step 2.1 read csv file into dataframe data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False) ### Step 2.2 do some

How to feed a conv2d net with a large npy file without overhelming the RAM memory?

丶灬走出姿态 提交于 2021-01-29 07:36:25
问题 I have a large dataset in a .npy format of size (500000,18). In order to feed it in a conv2D net using a generator I slipt in in X and y and reshape it in the format (-1, 96, 10, 10, 17) and (-1, 1), respectively. However, when I feed it inside the model I get and memory error: 2020-08-26 14:37:03.691425: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 462080 totalling 451.2KiB 2020-08-26 14:37:03.691432: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks

dask handle delayed failures

断了今生、忘了曾经 提交于 2020-12-16 02:25:11
问题 How can I port the following function to dask in order to parallelize it? from time import sleep from dask.distributed import Client from dask import delayed client = Client(n_workers=4) from tqdm import tqdm tqdm.pandas() # linear things = [1,2,3] _x = [] _y = [] def my_slow_function(foo): sleep(2) x = foo y = 2 * foo assert y < 5 return x, y for foo in tqdm(things): try: x_v, y_v = my_slow_function(foo) _x.append(x_v) if y_v is not None: _y.append(y_v) except AssertionError: print(f'failed:

dask handle delayed failures

好久不见. 提交于 2020-12-16 02:22:33
问题 How can I port the following function to dask in order to parallelize it? from time import sleep from dask.distributed import Client from dask import delayed client = Client(n_workers=4) from tqdm import tqdm tqdm.pandas() # linear things = [1,2,3] _x = [] _y = [] def my_slow_function(foo): sleep(2) x = foo y = 2 * foo assert y < 5 return x, y for foo in tqdm(things): try: x_v, y_v = my_slow_function(foo) _x.append(x_v) if y_v is not None: _y.append(y_v) except AssertionError: print(f'failed:

dask handle delayed failures

北城以北 提交于 2020-12-16 02:22:17
问题 How can I port the following function to dask in order to parallelize it? from time import sleep from dask.distributed import Client from dask import delayed client = Client(n_workers=4) from tqdm import tqdm tqdm.pandas() # linear things = [1,2,3] _x = [] _y = [] def my_slow_function(foo): sleep(2) x = foo y = 2 * foo assert y < 5 return x, y for foo in tqdm(things): try: x_v, y_v = my_slow_function(foo) _x.append(x_v) if y_v is not None: _y.append(y_v) except AssertionError: print(f'failed:

dask handle delayed failures

青春壹個敷衍的年華 提交于 2020-12-16 02:21:31
问题 How can I port the following function to dask in order to parallelize it? from time import sleep from dask.distributed import Client from dask import delayed client = Client(n_workers=4) from tqdm import tqdm tqdm.pandas() # linear things = [1,2,3] _x = [] _y = [] def my_slow_function(foo): sleep(2) x = foo y = 2 * foo assert y < 5 return x, y for foo in tqdm(things): try: x_v, y_v = my_slow_function(foo) _x.append(x_v) if y_v is not None: _y.append(y_v) except AssertionError: print(f'failed: