dask

Dask running out of memory even with chunks

左心房为你撑大大i 提交于 2020-01-30 03:26:41
问题 I'm working with big CSV files and I need to make a Cartesian Product (merge operation) . I've tried to face the problem with Pandas (you can check Panda's code and a data format example for the same problem , here) without success due to memory errors. Now, I'm trying with Dask, which is supposed to manage huge datasets even when its size is bigger than the available RAM. First of all I read both CSV: from dask import dataframe as dd BLOCKSIZE = 64000000 # = 64 Mb chunks df1_file_path = '.

Plotting 2D data using Xarray takes a surprisingly long time?

半腔热情 提交于 2020-01-25 07:12:41
问题 I am reading NetCDF files using xarray. Each variable have 4 dimensions ( Times, lev, y, x ). After reading the variable, I calculate the mean of the variable QVAPOR along ( Times,lev ) dimensions. After calculation I get variable QVAPOR_mean which is a 2D variable with shape ( y: 699, x: 639 ). Xarray took only 10micro seconds to read the data with shape ( Times:2918, lev:36, y:699, x:639 ); but took more than 60 minutes to plot the filled contour of the data of shape ( y: 699, x: 639 ). I

Parallel excel sheet read from dask

℡╲_俬逩灬. 提交于 2020-01-22 16:10:06
问题 Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call. if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them parallely? P.S. I am using pandas 0.19.2 with python 2.7 回答1: For those using Python 3.6: #reading the file using dask import dask import dask.dataframe as dd from dask.delayed import delayed parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols =

Why `linedelimiter` does not work for bag.read_text?

*爱你&永不变心* 提交于 2020-01-16 16:28:27
问题 I am trying to load yaml from files created by entries = bag.from_sequence([{1:2}, {3:4}]) yamls = entries.map(yaml.dump) yamls.to_textfiles(r'\*.yaml.gz') with yamls = bag.read_test(r'\*.yaml.gz', linedelimiter='\n\n) but it reads files line by line. How to read yamls from files? UPDATE: While blocksize=None read_text reads files line by line. If blocksize is set, you could read compressed files. How to overcome this? Is uncompressing the files is the only option? 回答1: Indeed, linedelimiter

Push a pure-python module to Dask workers

微笑、不失礼 提交于 2020-01-16 15:46:47
问题 Is there an easy way in Dask to push a pure-python module to the workers? I have many workers in a cluster and I want to distribute a local module that I have on my client. I understand that for large packages like NumPy or Python I should distribute things in a more robust fashion, but I have a small module that changes frequently that shouldn't be too much work to move around. 回答1: Alternative if you wish to deploy a package to the workers after they have started you can do something

Dask opportunistic caching in custom graphs

爱⌒轻易说出口 提交于 2020-01-16 08:54:20
问题 I have a custom DAG such as: dag = {'load': (load, 'myfile.txt'), 'heavy_comp': (heavy_comp, 'load'), 'simple_comp_1': (sc_1, 'heavy_comp'), 'simple_comp_2': (sc_2, 'heavy_comp'), 'simple_comp_3': (sc_3, 'heavy_comp')} And I'm looking to compute the keys simple_comp_1 , simple_comp_2 , and simple_comp_3 , which I perform as follows, import dask from dask.distributed import Client from dask_yarn import YarnCluster task_1 = dask.get(dag, 'simple_comp_1') task_2 = dask.get(dag, 'simple_comp_2')

What is the way to add an index column in Dask when reading from a CSV?

為{幸葍}努か 提交于 2020-01-15 10:33:53
问题 I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column. CODE df = dd.read_csv(r'path\to

dask distributed memory error

此生再无相见时 提交于 2020-01-15 06:50:10
问题 I got the following error on the scheduler while running Dask on a distributed job: distributed.core - ERROR - Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/distributed/core.py", line 269, in write frames = protocol.dumps(msg) File "/usr/local/lib/python3.4/dist-packages/distributed/protocol.py", line 81, in dumps frames = dumps_msgpack(small) File "/usr/local/lib/python3.4/dist-packages/distributed/protocol.py", line 153, in dumps_msgpack payload = msgpack

How can I get result of Dask compute on a different machine than the one that submitted it?

时光总嘲笑我的痴心妄想 提交于 2020-01-15 03:22:05
问题 I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here: https://github.com/MoonVision/django-dask-demo/blob/master/demo/daskmanager/daskmanager.py I want to be able to separate the saving of a task from the server that submitted it for robustness and scalability. I also would like more detailed information as to the processing status of the task, right now the future status

Convert string to dict, then access key:values??? How to access data in a <class 'dict'> for Python?

让人想犯罪 __ 提交于 2020-01-14 07:29:08
问题 I am having issues accessing data inside a dictionary. Sys: Macbook 2012 Python: Python 3.5.1 :: Continuum Analytics, Inc. I am working with a dask.dataframe created from a csv. Edit Question How I got to this point Assume I start out with a Pandas Series: df.Coordinates 130 {u'type': u'Point', u'coordinates': [-43.30175... 278 {u'type': u'Point', u'coordinates': [-51.17913... 425 {u'type': u'Point', u'coordinates': [-43.17986... 440 {u'type': u'Point', u'coordinates': [-51.16376... 877 {u