dask

basic groupby operations in Dask

萝らか妹 提交于 2019-12-10 21:30:12
问题 I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group. In pandas, I would do the following: df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill') What would be the equivalent in Dask? Also, I am a little bit lost as to how to structure problems in Dask as opposed to in

Slow Performance with Python Dask bag?

走远了吗. 提交于 2019-12-10 18:58:38
问题 I'm trying out some tests of dask.bag to prepare for a big text processing job over millions of text files. Right now, on my test sets of dozens to hundreds of thousands of text files, I'm seeing that dask is running about 5 to 6 times slower than a straight single-threaded text processing function. Can someone explain where I'll see the speed benefits of running dask over a large amount of text files? How many files would I have to process before it starts getting faster? Is 150,000 small

How to avoid an empty result with `Bag.take(n)` when using dask?

我们两清 提交于 2019-12-10 18:23:14
问题 Context: Dask documentation states clearly that Bag.take() will only collect from the first partition. However, when using a filter it can occur that the first partition is empty, while others are not. Question: Is it possible to use Bag.take() so that it collects from a sufficient number of partitions to collect the n items (or the maximum available less than than n ). 回答1: You could do something like the following: from toolz import take f = lambda seq: list(take(n, seq)) b.reduction(f, f)

parallel dask for loop slower than regular loop?

青春壹個敷衍的年華 提交于 2019-12-10 16:05:01
问题 If I try to parallelize a for loop with dask, it ends up executing slower than the regular version. Basically, I just follow the introductory example from the dask tutorial, but for some reason it's failing on my end. What am I doing wrong? In [1]: import numpy as np ...: from dask import delayed, compute ...: import dask.multiprocessing In [2]: a10e4 = np.random.rand(10000, 11).astype(np.float16) ...: b10e4 = np.random.rand(10000, 11).astype(np.float16) In [3]: def subtract(a, b): ...:

How to specify the directory that dask uses for temporary files?

时光总嘲笑我的痴心妄想 提交于 2019-12-10 16:00:29
问题 Apparently, dask writes to the /tmp folder during disk based shuffle operations. On the system that I am using, this folder is mounted on a very small partition (30GB), causing the following error after some calculations: IOError: [Errno 28] No space left on device Traceback File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 263, in execute_task result = _execute_task(task, data) File "[path_to_anaconda]/lib/python2.7/site-packages/dask/async.py", line 245, in _execute

How to speed up nested cross validation in python?

放肆的年华 提交于 2019-12-10 13:34:56
问题 From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question. I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n

Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

本秂侑毒 提交于 2019-12-10 11:11:01
问题 I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post: import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)), columns = ['A', 'B']) threshold = 10 # Anything that occurs less than this will be removed. value_counts = df.stack().value_counts() # Entire DataFrame to_remove = value_counts[value_counts <= threshold].index df.replace(to_remove, np.nan, inplace=True) The

dask computation not executing in parallel

混江龙づ霸主 提交于 2019-12-10 03:53:59
问题 I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import dask.dataframe as dd import dask.bag as db import json txt = db.from_filenames('part-*.json') js = txt.map(json.loads) df = js.to_dataframe() cs=df.to_castra("data.castra") I am running it on a 32 core machine, but the code only utilizes one core at 100%. My

What is the “right” way to close a Dask LocalCluster?

蓝咒 提交于 2019-12-10 03:17:17
问题 I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg backend). For example, if I close both the client and the cluster in this order then tk can not remove in an appropriate way the image from the memory and I get the following error: Traceback (most recent call last): File "/opt/Python-3.6.0/lib/python3.6

How to set up logging on dask distributed workers?

对着背影说爱祢 提交于 2019-12-09 14:47:39
问题 After upgrading of dask distributed to version 1.15.0 my logging stopped working. I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work anymore. If I do dictConfig right before every log call on every worker it works but it's not a proper solution. So the question is how it initialize logging on every worker before my computation graph starts executing and do it only once per