dask

having problemns while using dask map_partitions with string matching algorithm

我的梦境 提交于 2019-12-13 03:47:15
问题 I'm having some probems apllying a text search algorithm with parallelized dask insfrastructure. I'm tryng to find the best match for 40,000 stirngs in a series object against a 4000 string list. I could have done it using pandas.apply but it's to time expensive, so i decided try parallelization with map_partitions in dask. I'm using this text search library with python-Levenshtein https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python As you can see, it works ok on this

Dask: is it safe to pickle a dataframe for later use?

泄露秘密 提交于 2019-12-12 18:02:22
问题 I have a database-like object containing many dask dataframes. I would like to work with the data, save it and reload it on the next day to continue the analysis. Therefore, I tried saving dask dataframes (not computation results, just the "plan of computation" itself) using pickle. Apparently, it works (at least, if I unpickle the objects on the exact same machine) ... but are there some pitfalls? 回答1: Generally speaking it is usually safe. However there are a few caveats: If your dask

read process and concatenate pandas dataframe in parallel with dask

南笙酒味 提交于 2019-12-12 16:04:49
问题 I'm trying to read and process in parallel a list of csv files and concatenate the output in a single pandas dataframe for further processing. My workflow consist of 3 steps: create a series of pandas dataframe by reading a list of csv files (all with the same structure) def loadcsv(filename): df = pd.read_csv(filename) return df for each dataframe create a new column by processing 2 existing columns def makegeom(a,b): return 'Point(%s %s)' % (a,b) def applygeom(df): df['Geom']= df.apply

Item Assignment Not Supported in Dask

强颜欢笑 提交于 2019-12-12 15:22:11
问题 What are the ways we can have perform item assignment in Dask Arrays? Even a very simple item assignment like: a[0] = 2 does not work. 回答1: Correct. This is the first limitation noted in the documentation. In general, workflows that involve for loops and direct assignment of individual elements are hard to parallelize. Dask array does not make this attempt. 来源: https://stackoverflow.com/questions/40935756/item-assignment-not-supported-in-dask

Is there an advantage to pre-scattering data objects in Dask?

拈花ヽ惹草 提交于 2019-12-12 14:02:47
问题 If I pre-scatter a data object across worker nodes, does it get copied in its entirety to each of the worker nodes? Is there an advantage in doing so if that data object is big? Using the futures interface as an example: client.scatter(data, broadcast=True) results = dict() for i in tqdm_notebook(range(replicates)): results[i] = client.submit(nn_train_func, data, **params) Using the delayed interface as an example: client.scatter(data, broadcast=True) results = dict() for i in tqdm_notebook

How to map a dask Series with a large dict

自闭症网瘾萝莉.ら 提交于 2019-12-12 13:18:20
问题 I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size <X> MB detected in task graph and suggests using client.scatter and client.submit but the latter doesn't solve the problem and in fact it's much slower. Trying broadcast=True in client.scatter doesn't help either. import argparse import distributed import dask.dataframe as dd import numpy as np import pandas as pd def compute(s

Constructing Mode and Corresponding Count Functions Using Custom Aggregation Functions for GroupBy in Dask

时光总嘲笑我的痴心妄想 提交于 2019-12-12 11:25:50
问题 So dask has now been updated to support custom aggregation functions for groupby. (Thanks to the dev team and @chmp for working on this!). I am currently trying to construct a mode function and corresponding count function. Basically what I envision is that mode returns a list, for each grouping, of the most common values for a specific column (ie. [4, 1, 2]). Additionally, there is a corresponding count function that returns the number of instances of those values, ie. 3. Now I am currently

Dask Distributed Diagnostic Webpage not working

不打扰是莪最后的温柔 提交于 2019-12-12 10:53:50
问题 I've gotten dask up and running on my cluster, but I can't seem to access the diagnostic webpage. The landing page is visible, see below: But all the links just hang and never load the page. The scheduler started fine with this output: [hoffmand@h05u06 ~]$ dask-scheduler --scheduler-file dask-scheduler.json distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Scheduler at: tcp://10.36.105.16:8786 distributed.scheduler - INFO - bokeh at:

Why do pandas and dask perform better when importing from CSV compared to HDF5?

僤鯓⒐⒋嵵緔 提交于 2019-12-12 08:16:08
问题 I am working with a system that currently operates with large (>5GB) .csv files. To increase performance, I am testing (A) different methods to create dataframes from disk (pandas VS dask) as well as (B) different ways to store results to disk (.csv VS hdf5 files). In order to benchmark performance, I did the following: def dask_read_from_hdf(): results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security']) analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique() hdf.close()

BokehWebInterface not working for Dask Distributed

扶醉桌前 提交于 2019-12-12 05:20:54
问题 I have updated my Dask from version 0.14.3 to 0.15.0, and distributed from 1.16.3 to 1.17.0. BokehWebInterface has been removed from this version. The homepage can be loaded http://localhost:8787, but I can't access tasks, status, workers(It tries to reload until all task are finished and then gives can't reach error). Everything used to work on the earlier version. loop = IOLoop.current() t = Thread(target=loop.start) t.setDaemon(True) t.start() workers = [] services = {('http', HTTP_PORT):