dask | 易学教程

Managing worker memory on a dask localcluster

阅读更多关于 Managing worker memory on a dask localcluster

问题 I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this: WARNING - Worker exceeded 95% memory budget. Restarting. I am just working on my local machine, initiating dask as follows: if __name__ == '__main__': libmarket.config.client = Client() # use dask.distributed by default Now in my error messages I keep seeing a reference to a 'memory_limit=' keyword parameter. However I've searched the dask documentation thoroughly and I can't

What is the equivalent to iloc for dask dataframe?

阅读更多关于 What is the equivalent to iloc for dask dataframe?

问题 I have a situation where I need to index a dask dataframe by location. I see that there is not an .iloc method available. Is there an alternative? Or am I required to use label-based indexing? For example, I would like to import dask.dataframe as dd import numpy as np import pandas as pd df = dd.from_pandas(pd.DataFrame({k:np.random.random(10) for k in ['a', 'b']}), npartitions=2) inds = [1, 4, 6, 8] df.iloc[inds] Is this not possible with dask? (e.g., Perhaps a positional location is not

How to rename the index of a Dask Dataframe

阅读更多关于 How to rename the index of a Dask Dataframe

问题 How would I go about renaming the index on a dask dataframe? I tried it like so df.index.name = 'foo' but rechecking df.index.name shows it still being whatever it was previously. 回答1: This does not seem like an efficient way to do it, so I wouldn't be surprised if there is something more direct. d.index.name starts off as 'foo'; def f(df, name): df.index.name = name return df d.map_partitions(f, 'pow') The output now has index name of 'pow'. If this is done with the threaded scheduler, I

Applying a function along an axis of a dask array

阅读更多关于 Applying a function along an axis of a dask array

问题 I'm analyzing ocean temperature data from a climate model simulation where the 4D data arrays (time, depth, latitude, longitude; denoted dask_array below) typically have a shape of (6000, 31, 189, 192) and a size of ~25GB (hence my desire to use dask; I've been getting memory errors trying to process these arrays using numpy). I need to fit a cubic polynomial along the time axis at each level / latitude / longitude point and store the resulting 4 coefficients. I've therefore set chunksize=

How can I select data from a dask dataframe by a list of indices?

阅读更多关于 How can I select data from a dask dataframe by a list of indices?

问题 Let's say, I have the following dask dataframe. dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']} pdf = pd.DataFrame(dict_) pdf = pdf.set_index('index') ddf = dask.dataframe.from_pandas(pdf, npartitions = 2) Furthermore, I have a list of indices, that I am interested in, e.g. indices_i_want_to_select = ['x1','x3', 'y6'] How can I generate a new dask dataframe, that contains only the rows specified by the indices? Is there a reason, why

Shutdown dask workers from client or scheduler

阅读更多关于 Shutdown dask workers from client or scheduler

问题 In the API, there is a way to restart all workers and to shutdown the client completely, but I see no way to stop all workers while keeping the client unchanged. Is there a way to do this that I cannot find or is it a feature that doesn't exist ? 回答1: This seems like a feature that does not exist, but is nevertheless doable using the current code. You can use run_on_scheduler to interact with the methods of the scheduler itself. workers = list(c.scheduler_info()['workers']) c.run_on_scheduler

Create sql table from dask dataframe using map_partitions and pd.df.to_sql

阅读更多关于 Create sql table from dask dataframe using map_partitions and pd.df.to_sql

问题 Dask doesn't have a df.to_sql() like pandas and so I am trying to replicate the functionality and create an sql table using the map_partitions method to do so. Here is my code: import dask.dataframe as dd import pandas as pd import sqlalchemy_utils as sqla_utils db_url = 'my_db_url_connection' conn = sqla.create_engine(db_url) ddf = dd.read_csv('data/prod.csv') meta=dict(ddf.dtypes) ddf.map_partitions(lambda df: df.to_sql('table_name', db_url, if_exists='append',index=True), ddf, meta=meta)

How to use pandas.cut() (or equivalent) in dask efficiently?

阅读更多关于 How to use pandas.cut() (or equivalent) in dask efficiently?

问题 Is there an equivalent to pandas.cut() in Dask? I try to bin and group a large dataset in Python. It is a list of measured electrons with the properties (positionX, positionY, energy, time). I need to group it along positionX, positionY and do binning in energy classes. So far I could do it with pandas, but I would like to run it in parallel. So, I try to use dask. The groupby method works very well, but unfortunately, I run into difficulties when trying to bin the data in energy. I found a

Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

阅读更多关于 Passing a Paramiko connection SFTPFile as input to a dask.dataframe.read_parquet

I tried to pass class paramiko.sftp_file.SFTPFile instead of file URL for pandas.read_parquet and it worked fine. But when I tried the same with Dask, it threw an error. Below is the code I tried to run and the error I get. How can I make this work? import dask.dataframe as dd import parmiko ssh=paramiko.SSHClient() sftp_client = ssh.open_sftp() ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy()) source_file=sftp_client.open(str(parquet_file),'rb') full_df = dd.read_parquet(source_file,engine='pyarrow') print(len(full_df)) Traceback (most recent call last): File "C:\Users\rrrrr\Documents

File Not Found Error in Dask program run on cluster

阅读更多关于 File Not Found Error in Dask program run on cluster

I have 4 machines, M1, M2, M3, and M4. The scheduler, client, worker runs on M1. I've put a csv file in M1. Rest of the machines are workers. When I run the program with read_csv file in dask. It gives me Error, file not found When one of your workers tries to load the CSV, it will not be able to find it, because it is not present on that local disc. This should not be a surprise. You can get around this in a number of ways: copy the file to every worker; this is obviously wasteful in terms of disc space, but the easiest to achieve place the file on a networked filesystem (NFS mount, gluster,