dask

Applying a function along an axis of a dask array

别来无恙 提交于 2019-12-05 07:58:03
I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy). I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point. This is my function for getting the

Shutdown dask workers from client or scheduler

人盡茶涼 提交于 2019-12-05 06:03:58
In the API, there is a way to restart all workers and to shutdown the client completely, but I see no way to stop all workers while keeping the client unchanged. Is there a way to do this that I cannot find or is it a feature that doesn't exist ? mdurant This seems like a feature that does not exist, but is nevertheless doable using the current code. You can use run_on_scheduler to interact with the methods of the scheduler itself. workers = list(c.scheduler_info()['workers']) c.run_on_scheduler(lambda dask_scheduler=None: dask_scheduler.retire_workers(workers, close_workers=True)) where c is

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

不羁岁月 提交于 2019-12-05 05:59:50
I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame definition import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]}) ddf = dd.from_pandas(df1, npartitions=2) Method-1 tried def funcUpdate(row): if row['y'].isnull(): return row['y'] else: return round((1 + row['x'])/(1+ 1/row['y']),4) ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf)) It gives an error: TypeError: Column assignment doesn't

how do we choose --nthreads and --nprocs per worker in dask distributed?

早过忘川 提交于 2019-12-05 04:56:18
how do we choose --nthreads and --nprocs per worker in Dask distributed? i have 3 workers , with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of 'lscpu' Linux command on each worker) It depends on your workload By default Dask creates a single process with as many threads as you have logical cores on your machine (as determined by multiprocessing.cpu_count() ). dask-worker ... --nprocs 1 --nthreads 8 # assuming you have eight cores dask-worker ... # this is actually the default setting Using few processes and many threads per process is good if

dask computation not executing in parallel

二次信任 提交于 2019-12-05 03:45:54
I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import dask.dataframe as dd import dask.bag as db import json txt = db.from_filenames('part-*.json') js = txt.map(json.loads) df = js.to_dataframe() cs=df.to_castra("data.castra") I am running it on a 32 core machine, but the code only utilizes one core at 100%. My understanding from the docs is that this code execute in parallel. Why is it not? Did I misunderstand

How to use pandas.cut() (or equivalent) in dask efficiently?

十年热恋 提交于 2019-12-05 02:40:15
Is there an equivalent to pandas.cut() in Dask? I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes. So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask. The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a solution using pandas.cut(), but it requires to call compute() on the raw dataset (turning it essentialy

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

眉间皱痕 提交于 2019-12-05 01:28:03
问题 I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis. Up to now I was using dask.delayed objects to load,

Default pip installation of Dask gives “ImportError: No module named toolz”

依然范特西╮ 提交于 2019-12-05 01:01:39
I installed Dask using pip like this: pip install dask and when I try to do import dask.dataframe as dd I get the following error message: >>> import dask.dataframe as dd Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/venv/lib/python2.7/site-packages/dask/__init__.py", line 5, in <module> from .async import get_sync as get File "/path/to/venv/lib/python2.7/site-packages/dask/async.py", line 120, in <module> from toolz import identity ImportError: No module named toolz No module named toolz I noticed that the documentation states pip install dask : Install

Replace a dask dataframe partition

吃可爱长大的小学妹 提交于 2019-12-04 20:31:19
Can I replace a dask dataframe partition, with another dask dataframe partition that I've created separately, of the same number of rows and same structure? If yes, how? Is it possible with a different number of rows? You can add partitions to the beginning or end of a Dask dataframe using the dd.concat function. You can insert a new partition anywhere in the dataframe by switching to delayed objects, inserting a delayed object into the list, and then switching back to dask dataframe list_of_delayed = dask_df.to_delayed() new_partition = dask.delayed(pd.read_csv)(filename) list_of_delayed[i] =

Dask DataFrame: Resample over groupby object with multiple rows

試著忘記壹切 提交于 2019-12-04 18:09:15
I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 9235 2015-08-08 01:10:00 a 2015-08-08 02:20:00 2353 2015-08-08 02:20:00 b 2015-08-08 02:20:00 9235 2015-08-08 02:20:00 c 2015-08-08 04:10:00 9235 2015-08-08 04:10:00 d 2015-08-08 08:10:00 2353 2015-08-08 08:10:00 e What I'm trying to do is: Group by user_id and ts Resample it over a 3-hour period In the resampling step, any merged rows should concatenate the texts Example output: text user