dask | 易学教程

why is dot product in dask slower than in numpy

阅读更多关于 why is dot product in dask slower than in numpy

问题 a dot product in dask seems to run much slower than in numpy: import numpy as np x_np = np.random.normal(10, 0.1, size=(1000,100)) y_np = x_np.transpose() %timeit x_np.dot(y_np) # 100 loops, best of 3: 7.17 ms per loop import dask.array as da x_dask = da.random.normal(10, 0.1, size=(1000,100), chunks=(5,5)) y_dask = x_dask.transpose() %timeit x_dask.dot(y_dask) # 1 loops, best of 3: 6.56 s per loop Does anybody know what might be the reason for that? Is there anything I'm missing here? 回答1:

What is map_partitions doing?

阅读更多关于 What is map_partitions doing?

问题 The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like) a list whose length equals the number of partitions. Each element of the list should be one of the return values of the function calls. However, with respect to the following code, I am not sure, what the return value depends on: #generate example

Loading local file from client onto dask distributed cluster

阅读更多关于 Loading local file from client onto dask distributed cluster

问题 A bit of a beginner question, but I was not able to find a relevant answer on this.. Essentially my data about (7gb) is located on my local machine. I have distributed cluster running on the local network. How can I get this file onto the cluster? The usual dd.read_csv() or read_parquet() fails, as the workers aren't able to locate the file in their own environments. Would I need to manually transfer the file to each node in the cluster? Note: Due to admin restrictions I am limited to SFTP...

Can't drop columns or slice dataframe using dask?

阅读更多关于 Can't drop columns or slice dataframe using dask?

问题 I am trying to use dask instead of pandas since I have 2.6gb csv file. I load it and I want to drop a column. but it seems that neither the drop method df.drop('column') or slicing df[ : , :-1] is implemented yet. Is this the case or am I just missing something ? 回答1: We implemented the drop method in this PR. This is available as of dask 0.7.0. In [1]: import pandas as pd In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]}) In [3]: import dask.dataframe as dd In [4]: ddf = dd.from

How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

阅读更多关于 How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

TLDR : I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in the same way that pandas can transpose a dataframe using df.T. Details : I have sample twitter data from my timeline here . To get to my starting point, here is the code to read a json from disk into a dask.bag and then convert that into a dask.dataframe import dask.bag as db import dask.dataframe as dd import json b = db.read_text('./sampleTwitter

simple dask map_partitions example

阅读更多关于 simple dask map_partitions example

I read the following SO thead and now am trying to understand it. Here is my example: import dask.dataframe as dd import pandas as pd from dask.multiprocessing import get import random df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 'col_2': random.sample(range(10000), 10000) }) def test_f(col_1, col_2): return col_1*col_2 ddf = dd.from_pandas(df, npartitions=8) ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get) It generates the following error below. What am I doing wrong? Also I am not clear how to pass additional parameters to function in

Dask Array from DataFrame

阅读更多关于 Dask Array from DataFrame

Is there a way to easily convert a DataFrame of numeric values into an Array? Similar to values with a pandas DataFrame. I can't seem to find any way to do this with the provided API, but I'd assume it's a common operation. Edit: yes, now this is trivial You can use the .values property x = df.values Older, now incorrect answer At the moment there is no trivial way to do this. This is because dask.array needs to know the length of all of its chunks and dask.dataframe doesn't know this length. This can not be a completely lazy operation. That being said, you can accomplish it using dask.delayed

How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

阅读更多关于 How do you transpose a dask dataframe (convert columns to rows) to approach tidy data principles

问题 TLDR : I created a dask dataframe from a dask bag. The dask dataframe treats every observation (event) as a column. So, instead of having rows of data for each event, I have a column for each event. The goal is to transpose the columns to rows in the same way that pandas can transpose a dataframe using df.T. Details : I have sample twitter data from my timeline here. To get to my starting point, here is the code to read a json from disk into a dask.bag and then convert that into a dask

Dask equivalent to Pandas replace?

阅读更多关于 Dask equivalent to Pandas replace?

Something I use regularly in pandas is the .replace operation. I am struggling to see how one readily performs this same operation on a dask dataframe? df.replace('PASS', '0', inplace=True) df.replace('FAIL', '1', inplace=True) You can use mask : df = df.mask(df == 'PASS', '0') df = df.mask(df == 'FAIL', '1') Or equivalently chaining the mask calls: df = df.mask(df == 'PASS', '0').mask(df == 'FAIL', '1') If anyone would like to know how to replace certain values in a specific column , here's how to do this: def replace(x: pd.DataFrame) -> pd.DataFrame: return x.replace( {'a_feature': ['PASS',

Slow len function on dask distributed dataframe

阅读更多关于 Slow len function on dask distributed dataframe

问题 I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc. import dask.dataframe as dd from dask.distributed import Client client = Client('192.168.1.220:8786') log = pd.read_csv('800000test', sep='\t') logd = dd.from_pandas(log,npartitions=20) #This is the code than runs slowly #(2.9 seconds whilst I would expect no more than a few hundred millisencods) print(len(logd)) #Instead this code is