dask

Dask Distributed client takes to long to initialize in jupyter lab

妖精的绣舞 提交于 2019-12-11 02:46:21
问题 Trying to initialize a client with local cluster in Jupyter lab but hangs. This behaviour happens for python 3.5 and jupyter lab 0.35. import dask.dataframe as dd from dask import delayed from distributed import Client from distributed import LocalCluster import pandas as pd import numpy as np import json cluster = LocalCluster() client = Cluster(cluster) client versions of the tools : Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright",

Masking in Dask

拟墨画扇 提交于 2019-12-11 02:35:18
问题 I was just wondering if someone could help show me how to apply functions such as "sum" or "mean" on masks arrays using dask. I wish to calculate the sum / mean of the array on only values where there is no mask. Code: import dask.array as da import numpy as np import numpy.ma as ma dset = [1, 2, 3, 4] masked = ma.masked_equal(dset, (4)) # lets say 4 should be masked print(np.sum(masked)) # output: 6 print(np.mean(masked)) # output: 2 print(masked) # output: [1, 2, 3, -] masked_array = da

Reading csv with separator in python dask

浪子不回头ぞ 提交于 2019-12-11 01:33:08
问题 I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes The code is: import dask.dataframe as dd df = dd.read_csv('D:\temp.csv',sep='#####',engine='python') res = df.compute() Error is: dask.async.ValueError: Dask dataframe inspected the first 1,000 rows of your csv file to guess the data types of your columns. These first 1,000 rows led us to an incorrect guess. For example a column may have had integers in the first 1000 rows followed by a float or missing

Fault tolerance in Spark vs Dask

佐手、 提交于 2019-12-11 00:56:38
问题 I read the following on the Dask documentation in the known limitations section: It [Dask] is not fault tolerant. The failure of any worker is likely to crash the system. It does not fail gracefully in case of errors but I don't see any mentions of fault tolerance in the comparison with Spark. These are currently the "Reasons why you might choose Spark": You prefer Scala or the SQL language You have mostly JVM infrastructure and legacy systems You want an established and trusted solution for

Creating a dask bag from a generator

孤者浪人 提交于 2019-12-11 00:23:42
问题 I would like to create a dask.Bag (or dask.Array ) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory. delayed_array = [delayed(generator) for generator in list_of_generators] my_bag = db.from_delayed(delayed_array) NB list_of_generators is exactly that - the generators haven't been consumed (yet). My problem is that when creating delayed_array the generators are consumed and RAM is exhausted. Is there a way to get these long lists into the

Numba `nogil` + dask threading backend results in no speed up (computation is slower!)

一曲冷凌霜 提交于 2019-12-11 00:19:30
问题 I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jit ed function and then split the work among the CPU cores using dask . I wanted to use the nogil feature of numba.jit function so that I could use the dask threading backend so as to avoid unnecessary memory copies of the input data (which is very large). Unfortunately, Dask

How do I convert a list of Pandas futures to a Dask Dataframe?

拈花ヽ惹草 提交于 2019-12-10 23:35:34
问题 I have a list of Dask futures that point to Pandas dataframes: from dask.dataframe import Client client = Client() import pandas futures = client.map(pd.read_csv, filenames) How do I convert these to a Dask dataframe? note, I know that dask.dataframe.read_csv exists, I'm just using pd.read_csv as an example 回答1: You probably want dask.dataframe.from_delayed import dask.dataframe as dd df = dd.from_delayed(futures) See the docstring for additional options. 来源: https://stackoverflow.com

Collecting attributes from dask dataframe providers

江枫思渺然 提交于 2019-12-10 22:56:34
问题 TL;DR : How can I collect metadata (errors during parsing) from distributed reads into a dask dataframe collection. I currently have a proprietary file format i'm using to feed into dask.DataFrame. I have a function that accepts a file path and returns a pandas.DataFrame, used internally by dask.DataFrame successfully to load multiple files to the same dask.DataFrame. Up until recently, I was using my own code to merge several pandas.DataFrames into one, and now i'm working on using dask

Generating batches of images in dask

心已入冬 提交于 2019-12-10 22:48:09
问题 I just started with dask because it offers great parallel processing power. I have around 40000 images on my disk which I am going to use for building a classifier using some DL library, say Keras or TF . I collected this meta-info(image path and corresponding label) in a pandas dataframe, which looks like this: img_path labels 0 data/1.JPG 1 1 data/2.JPG 1 2 data/3.JPG 5 ... Now here is my simple task: Use dask to read images and corresponding labels in a lazy fashion. Do some processing on

Generating parquet files - differences between R and Python

给你一囗甜甜゛ 提交于 2019-12-10 21:48:44
问题 We have generated a parquet file in Dask (Python) and with Drill (R using the Sergeant packet ). We have noticed a few issues: The format of the Dask (i.e. fastparquet ) has a _metadata and a _common_metadata files while the parquet file in R \ Drill does not have these files and have parquet.crc files instead (which can be deleted). what is the difference between these parquet implementations? 回答1: (only answering to 1), please post separate questions to make it easier to answer) _metadata