dask

Workaround for Item assignment not supported in dask

谁都会走 提交于 2019-12-02 10:35:14
I am trying to convert my code base from numpy array to dask because my numpy arrays are exceeding the Memory Error limit. But, I come to know that the feature of mutable arrays are not yet implemented in dask arrays so I am getting NotImplementedError: Item assignment with <class 'tuple'> not supported Is there any workaround for my code below- for i, mask in enumerate(masks): bounds = find_boundaries(mask, mode='inner') X2, Y2 = np.nonzero(bounds) X2 = da.from_array(X2, 'auto') Y2 = da.from_array(Y2, 'auto') xSum = (X2.reshape(-1, 1) - X1.reshape(1, -1)) ** 2 ySum = (Y2.reshape(-1, 1) - Y1

Dask read_sql_table errors out when using an SQLAlchemy expression

为君一笑 提交于 2019-12-02 07:38:50
I'm trying to use an SQLAlchemy expression with dask's read_sql_table in order to bring down a dataset that is created by joining and filtering a few different tables. The documentation indicates that this should be possible. (The example below, does not include any joins as they are not needed to replicate the problem.) I build my connection string, create an SQLAlchemy engine and table corresponding to a table in my database. (I'm using PostgreSQL.) import dask.dataframe as dd import pandas as pd from sqlalchemy import create_engine from sqlalchemy import Column, MetaData, Table from

How to do row processing and item assignment in Dask

蓝咒 提交于 2019-12-02 03:11:11
Similar unanswered question: Row by row processing of a Dask DataFrame I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to Dask is: for row in df.itertuples(): ratio = row.ratio tmpratio = row.tmpratio tmplabel = row.tmplabel if tmpratio > ratio: df.loc[row.Index,'ratio'] = tmpratio df.loc[row.Index,'label'] = tmplabel What is the appropriate way to set a value by index in Dask, or conditionally set values in rows? Given that .loc doesn't support item assignment in Dask,

What threads do Dask Workers have active?

丶灬走出姿态 提交于 2019-12-01 21:59:40
When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing? Dask workers have the following threads: A pool of threads in which to run tasks. This is typically somewhere between 1 and the number of logical cores on the computer One administrative thread to manage the event loop, communication over (non-blocking) sockets, responding to fast queries, the allocation of tasks onto worker threads, etc.. A couple of threads that are used for optional compression and (de

Aggregate a Dask dataframe and produce a dataframe of aggregates

拥有回忆 提交于 2019-12-01 20:08:51
问题 I have a Dask dataframe that looks like this: url referrer session_id ts customer url1 ref1 xxx 2017-09-15 00:00:00 a.com url2 ref2 yyy 2017-09-15 00:00:00 a.com url2 ref3 yyy 2017-09-15 00:00:00 a.com url1 ref1 xxx 2017-09-15 01:00:00 a.com url2 ref2 yyy 2017-09-15 01:00:00 a.com I want to group the data on url and timestamp, aggregate column values and produce a dataframe that would look like this instead: customer url ts page_views visitors referrers a.com url1 2017-09-15 00:00:00 1 1

How to map a column with dask

旧城冷巷雨未停 提交于 2019-12-01 20:03:35
I want to apply a mapping on a DataFrame column. With Pandas this is straight forward: df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap)) This writes the infos column, based on the custom_map function, and uses the rows in numbers for the lambda statement. With dask this isn't that simple. ddf is a dask DataFrame. map_partitions is the equivalent to parallel execution of the mapping on a part of the DataFrame. This does not work because you don't define columns like that in dask. ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap)) Does anyone

How to map a column with dask

北城以北 提交于 2019-12-01 19:33:02
问题 I want to apply a mapping on a DataFrame column. With Pandas this is straight forward: df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap)) This writes the infos column, based on the custom_map function, and uses the rows in numbers for the lambda statement. With dask this isn't that simple. ddf is a dask DataFrame. map_partitions is the equivalent to parallel execution of the mapping on a part of the DataFrame. This does not work because you don't define columns like that in

Aggregate a Dask dataframe and produce a dataframe of aggregates

空扰寡人 提交于 2019-12-01 18:40:38
I have a Dask dataframe that looks like this: url referrer session_id ts customer url1 ref1 xxx 2017-09-15 00:00:00 a.com url2 ref2 yyy 2017-09-15 00:00:00 a.com url2 ref3 yyy 2017-09-15 00:00:00 a.com url1 ref1 xxx 2017-09-15 01:00:00 a.com url2 ref2 yyy 2017-09-15 01:00:00 a.com I want to group the data on url and timestamp, aggregate column values and produce a dataframe that would look like this instead: customer url ts page_views visitors referrers a.com url1 2017-09-15 00:00:00 1 1 [ref1] a.com url2 2017-09-15 00:00:00 2 2 [ref2, ref3] In Spark SQL, I can do this as follows: select

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

随声附和 提交于 2019-12-01 17:44:59
In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations. Code snippet: from dask import dataframe as dd import numpy as np import pandas as pd df = pd.DataFrame({'A': np.arange(5), 'B': np.arange(5), 'C': np.arange(5)}) ddf = dd.from_pandas(df, npartitions=1) def aggregate(x): print('B val received: ' + str(x.B)) return x ddf.apply(aggregate, axis=1).compute() But when the above code is run, I see this instead: B val received: 1 B val received: 1 B

How should I get the shape of a dask dataframe?

若如初见. 提交于 2019-12-01 15:46:47
Performing .shape is giving me the following error. AttributeError: 'DataFrame' object has no attribute 'shape' How should I get the shape instead? You can get the number of columns directly len(df.columns) # this is fast You can also call len on the dataframe itself, though beware that this will trigger a computation. len(df) # this requires a full scan of the data Dask.dataframe doesn't know how many records are in your data without first reading through all of it. To get the shape we can try this way: dask_dataframe.describe().compute() "count" column of the index will give the number of