dask | 易学教程

basic groupby operations in Dask

阅读更多关于 basic groupby operations in Dask

问题 I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group. In pandas, I would do the following: df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill') What would be the equivalent in Dask? Also, I am a little bit lost as to how to structure problems in Dask as opposed to in

Slow Performance with Python Dask bag?

阅读更多关于 Slow Performance with Python Dask bag?

问题 I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function. Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small

How to avoid an empty result with `Bag.take(n)` when using dask?

阅读更多关于 How to avoid an empty result with `Bag.take(n)` when using dask?

问题 Context: Dask documentation states clearly that Bag.take() will only collect from the first partition. However, when using a filter it can occur that the first partition is empty, while others are not. Question: Is it possible to use Bag.take() so that it collects from a sufficient number of partitions to collect the n items (or the maximum available less than than n ). 回答1: You could do something like the following: from toolz import take f = lambda seq: list(take(n, seq)) b.reduction(f, f)

parallel dask for loop slower than regular loop?

阅读更多关于 parallel dask for loop slower than regular loop?

问题 If I try to parallelize a for loop with dask, it ends up executing slower than the regular version. Basically, I just follow the introductory example from the dask tutorial, but for some reason it's failing on my end. What am I doing wrong? In [1]: import numpy as np ...: from dask import delayed, compute ...: import dask.multiprocessing In [2]: a10e4 = np.random.rand(10000, 11).astype(np.float16) ...: b10e4 = np.random.rand(10000, 11).astype(np.float16) In [3]: def subtract(a, b): ...:

How to specify the directory that dask uses for temporary files?

阅读更多关于 How to specify the directory that dask uses for temporary files?

问题 Apparently, dask writes to the /tmp folder during disk based shuffle operations. On the system that I am using, this folder is mounted on a very small partition (30GB), causing the following error after some calculations: IOError: [Errno 28] No space left on device Traceback File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 263, in execute_task result = _execute_task(task, data) File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 245, in _execute

How to speed up nested cross validation in python?

阅读更多关于 How to speed up nested cross validation in python?

问题 From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question. I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n

Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

阅读更多关于 Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

问题 I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)), columns = ['A', 'B']) threshold = 10 # Anything that occurs less than this will be removed. value_counts = df.stack().value_counts() # Entire DataFrame to_remove = value_counts[value_counts <= threshold].index df.replace(to_remove, np.nan, inplace=True) The

dask computation not executing in parallel

阅读更多关于 dask computation not executing in parallel

问题 I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import dask.dataframe as dd import dask.bag as db import json txt = db.from_filenames('part-*.json') js = txt.map(json.loads) df = js.to_dataframe() cs=df.to_castra("data.castra") I am running it on a 32 core machine, but the code only utilizes one core at 100%. My

What is the “right” way to close a Dask LocalCluster?

阅读更多关于 What is the “right” way to close a Dask LocalCluster?

问题 I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg backend). For example, if I close both the client and the cluster in this order then tk can not remove in an appropriate way the image from the memory and I get the following error: Traceback (most recent call last): File "/opt/Python-3.6.0/lib/python3.6

How to set up logging on dask distributed workers?

阅读更多关于 How to set up logging on dask distributed workers?

问题 After upgrading of dask distributed to version 1.15.0 my logging stopped working. I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work anymore. If I do dictConfig right before every log call on every worker it works but it's not a proper solution. So the question is how it initialize logging on every worker before my computation graph starts executing and do it only once per