dask | 易学教程

Applying a function along an axis of a dask array

阅读更多关于 Applying a function along an axis of a dask array

I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy). I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=(6000, 1, 1, 1) so I have a separate chunk for each grid point. This is my function for getting the

Shutdown dask workers from client or scheduler

阅读更多关于 Shutdown dask workers from client or scheduler

In the API, there is a way to restart all workers and to shutdown the client completely, but I see no way to stop all workers while keeping the client unchanged. Is there a way to do this that I cannot find or is it a feature that doesn't exist ? mdurant This seems like a feature that does not exist, but is nevertheless doable using the current code. You can use run_on_scheduler to interact with the methods of the scheduler itself. workers = list(c.scheduler_info()['workers']) c.run_on_scheduler(lambda dask_scheduler=None: dask_scheduler.retire_workers(workers, close_workers=True)) where c is

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

阅读更多关于 Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame definition import pandas as pd import dask.dataframe as dd df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]}) ddf = dd.from_pandas(df1, npartitions=2) Method-1 tried def funcUpdate(row): if row['y'].isnull(): return row['y'] else: return round((1 + row['x'])/(1+ 1/row['y']),4) ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf)) It gives an error: TypeError: Column assignment doesn't

how do we choose --nthreads and --nprocs per worker in dask distributed?

阅读更多关于 how do we choose --nthreads and --nprocs per worker in dask distributed?

how do we choose --nthreads and --nprocs per worker in Dask distributed? i have 3 workers , with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of 'lscpu' Linux command on each worker) It depends on your workload By default Dask creates a single process with as many threads as you have logical cores on your machine (as determined by multiprocessing.cpu_count() ). dask-worker ... --nprocs 1 --nthreads 8 # assuming you have eight cores dask-worker ... # this is actually the default setting Using few processes and many threads per process is good if

dask computation not executing in parallel

阅读更多关于 dask computation not executing in parallel

I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import dask.dataframe as dd import dask.bag as db import json txt = db.from_filenames('part-*.json') js = txt.map(json.loads) df = js.to_dataframe() cs=df.to_castra("data.castra") I am running it on a 32 core machine, but the code only utilizes one core at 100%. My understanding from the docs is that this code execute in parallel. Why is it not? Did I misunderstand

How to use pandas.cut() (or equivalent) in dask efficiently?

阅读更多关于 How to use pandas.cut() (or equivalent) in dask efficiently?

Is there an equivalent to pandas.cut() in Dask? I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes. So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask. The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a solution using pandas.cut(), but it requires to call compute() on the raw dataset (turning it essentialy

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

阅读更多关于 convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

问题 I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df) to turn these dictionaries into a pandas.DataFrame (resulting dataframe is around 100 mb for each file). I would like to append all dataframes into a single dask.dataframe for further analysis. Up to now I was using dask.delayed objects to load,

Default pip installation of Dask gives “ImportError: No module named toolz”

阅读更多关于 Default pip installation of Dask gives “ImportError: No module named toolz”

I installed Dask using pip like this: pip install dask and when I try to do import dask.dataframe as dd I get the following error message: >>> import dask.dataframe as dd Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/venv/lib/python2.7/site-packages/dask/__init__.py", line 5, in <module> from .async import get_sync as get File "/path/to/venv/lib/python2.7/site-packages/dask/async.py", line 120, in <module> from toolz import identity ImportError: No module named toolz No module named toolz I noticed that the documentation states pip install dask : Install

Replace a dask dataframe partition

阅读更多关于 Replace a dask dataframe partition

Can I replace a dask dataframe partition, with another dask dataframe partition that I've created separately, of the same number of rows and same structure? If yes, how? Is it possible with a different number of rows? You can add partitions to the beginning or end of a Dask dataframe using the dd.concat function. You can insert a new partition anywhere in the dataframe by switching to delayed objects, inserting a delayed object into the list, and then switching back to dask dataframe list_of_delayed = dask_df.to_delayed() new_partition = dask.delayed(pd.read_csv)(filename) list_of_delayed[i] =

Dask DataFrame: Resample over groupby object with multiple rows

阅读更多关于 Dask DataFrame: Resample over groupby object with multiple rows

I have the following dask dataframe created from Castra: import dask.dataframe as dd df = dd.from_castra('data.castra', columns=['user_id','ts','text']) Yielding: user_id / ts / text ts 2015-08-08 01:10:00 9235 2015-08-08 01:10:00 a 2015-08-08 02:20:00 2353 2015-08-08 02:20:00 b 2015-08-08 02:20:00 9235 2015-08-08 02:20:00 c 2015-08-08 04:10:00 9235 2015-08-08 04:10:00 d 2015-08-08 08:10:00 2353 2015-08-08 08:10:00 e What I'm trying to do is: Group by user_id and ts Resample it over a 3-hour period In the resampling step, any merged rows should concatenate the texts Example output: text user

订阅 dask