dask

Force dask to_parquet to write single file

天涯浪子 提交于 2020-06-17 10:02:19
问题 When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file? 回答1: Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library). You could in theory perform the

How To Do Model Predict Using Distributed Dask With a Pre-Trained Keras Model?

£可爱£侵袭症+ 提交于 2020-06-15 18:55:27
问题 I am loading my pre-trained keras model and then trying to parallelize a large number of input data using dask? Unfortunately, I'm running into some issues with this relating to how I'm creating my dask array. Any guidance would be greatly appreciated! Setup: First I cloned from this repo https://github.com/sanchit2843/dlworkshop.git Reproducible Code Example: import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import

Dask apply with custom function

。_饼干妹妹 提交于 2020-06-07 07:21:46
问题 I am experimenting with Dask, but I encountered a problem while using apply after grouping. I have a Dask DataFrame with a large number of rows. Let's consider for example the following N=10000 df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) }) ddf = dd.from_pandas(df, npartitions=8) I want to bin the values of col_1 and I follow the solution from here bins = np.linspace(0,1,11) labels = list(range(len(bins)-1)) ddf2 = ddf.map_partitions(test_f, 'col_1',bins

Dask apply with custom function

帅比萌擦擦* 提交于 2020-06-07 07:21:06
问题 I am experimenting with Dask, but I encountered a problem while using apply after grouping. I have a Dask DataFrame with a large number of rows. Let's consider for example the following N=10000 df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) }) ddf = dd.from_pandas(df, npartitions=8) I want to bin the values of col_1 and I follow the solution from here bins = np.linspace(0,1,11) labels = list(range(len(bins)-1)) ddf2 = ddf.map_partitions(test_f, 'col_1',bins

Dask with HTCondor scheduler

落花浮王杯 提交于 2020-06-01 09:20:48
问题 Background I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed . The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used. Issue The admin will install HTCondor as a scheduler for the cluster. Thought In order order to have my

Dask with HTCondor scheduler

♀尐吖头ヾ 提交于 2020-06-01 09:16:07
问题 Background I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed . The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used. Issue The admin will install HTCondor as a scheduler for the cluster. Thought In order order to have my

Seeing logs of dask workers

别来无恙 提交于 2020-05-30 10:13:17
问题 I'm having trouble changing the temporary directory in Dask. When I change the temporary-directory in dask.yaml for some reason Dask is still writing out in /tmp (which is full). I now want to try and debug this, but when I use client.get_worker_logs() I only get INFO output. I start my cluster with from dask.distributed import LocalCluster, Client cluster = LocalCluster(n_workers=1, threads_per_worker=4, memory_limit='10gb') client = Client(cluster) I already tried adding distributed.worker:

How to enable proper work stealing in dask.distributed when using task restrictions / worker resources?

余生长醉 提交于 2020-05-30 04:08:42
问题 Context I'm using dask.distributed to parallelise computations across machines. I therefore have dask-workers running on the different machines which connect to a dask-scheduler, to which I can then submit my custom graphs to together with the required keys. Due to network mount restrictions, my input data (and output storage) is only available to a subset of the machines ('i/o-hosts'). I tried to deal with this in two ways: all tasks involved in i/o operations are restricted to i/o-hosts

Dask worker seem die but cannot find the worker log to figure out why

霸气de小男生 提交于 2020-05-29 10:19:58
问题 I have a piece of DASK code run on local machine which work 90% of time but will stuck sometimes. Stuck mean. No crash, no error print out not cpu usage. never end. I google and think it maybe due to some worker dead. I will be very useful if I can see the worker log and figure out why. But I cannot find my worker log. I go to edit config.yaml to add loging but still see nothing from stderr. Then I go to dashboard --> info --> logs and see blank page. The code it stuck is X_test = df_test.to

How to check if dask dataframe is empty

笑着哭i 提交于 2020-05-29 02:46:33
问题 Is there a dask equivalent of pandas empty function? I want to check if a dask dataframe is empty but df.empty return AttributeError: 'DataFrame' object has no attribute 'empty' 回答1: Dask doesn't currently support this, but you can compute the length on the fly: len(df) == 0 len(df.index) == 0 # Likely to be faster 来源: https://stackoverflow.com/questions/50206730/how-to-check-if-dask-dataframe-is-empty