dask

How to efficiently submit tasks with large arguments in Dask distributed?

空扰寡人 提交于 2019-11-29 13:14:26
I want to submit functions with Dask that have large (gigabyte scale) arguments. What is the best way to do this? I want to run this function many times with different (small) parameters. Example (bad) This uses the concurrent.futures interface. We could use the dask.delayed interface just as easily. x = np.random.random(size=100000000) # 800MB array params = list(range(100)) # 100 small parameters def f(x, param): pass from dask.distributed import Client c = Client() futures = [c.submit(f, x, param) for param in params] But this is slower than I would expect or results in memory errors. OK,

dask dataframe read parquet schema difference

余生颓废 提交于 2019-11-29 12:16:05
I do the following: import dask.dataframe as dd from dask.distributed import Client client = Client() raw_data_df = dd.read_csv('dataset/nyctaxi/nyctaxi/*.csv', assume_missing=True, parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime']) The dataset is taken out of a presentation Mathew Rocklin has made and was used as a dask dataframe demo. Then I try to write it to parquet using pyarrow raw_data_df.to_parquet(path='dataset/parquet/2015.parquet/') # only pyarrow is installed Trying to read back: raw_data_df = dd.read_parquet(path='dataset/parquet/2015.parquet/') I get the following

How to see progress of Dask Compute task?

蹲街弑〆低调 提交于 2019-11-29 11:22:37
问题 I would like to see a progressbar on Jupyternotebook while i'm running a compute task using Dask, I'm counting all values of "id" column from a large csv file +4GB, so any ideas? import dask.dataframe as dd df = dd.read_csv('data/train.csv') df.id.count().compute() 回答1: If you're using the single machine scheduler then do this: from dask.diagnostics import ProgressBar ProgressBar().register() http://dask.pydata.org/en/latest/diagnostics-local.html If you're using the distributed scheduler

iterate over GroupBy object in dask

僤鯓⒐⒋嵵緔 提交于 2019-11-29 10:40:25
Is it possible, to iterate over a dask GroupBy object to get access to the underlying dataframes? I tried: import dask.dataframe as dd import pandas as pd pdf = pd.DataFrame({'A':[1,2,3,4,5], 'B':['1','1','a','a','a']}) ddf = dd.from_pandas(pdf, npartitions = 3) groups = ddf.groupby('B') for name, df in groups: print(name) However, this results in an error: KeyError: 'Column not found: 0' More generally speaking, what kind of interactions does the dask GroupBy object allow, except from the apply method? you could iterate through groups doing this with dask, maybe there is a better way but this

python-xarray: open_mfdataset concat along two dimensions

喜夏-厌秋 提交于 2019-11-29 07:11:07
I have files which are made of 10 ensembles and 35 time files. One of these files looks like: >>> xr.open_dataset('ens1/CCSM4_ens1_07ic_19820701-19820731_NPac_Jul.nc') <xarray.Dataset> Dimensions: (ensemble: 1, latitude: 66, longitude: 191, time: 31) Coordinates: * ensemble (ensemble) int32 1 * latitude (latitude) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ... * longitude (longitude) float32 100.0 101.0 102.0 103.0 104.0 105.0 106.0 ... * time (time) datetime64[ns] 1982-07-01 1982-07-02 1982-07-03 ... Data variables: u10m (time, latitude, longitude) float64 -1.471 -0.05933 -1.923 ...

Convert raster time series of multiple GeoTIFF images to NetCDF

守給你的承諾、 提交于 2019-11-29 05:16:37
I have a raster time series stored in multiple GeoTIFF files ( *.tif ) that I'd like to convert to a single NetCDF file. The data is uint16 . I could probably use gdal_translate to convert each image to netcdf using: gdal_translate -of netcdf -co FORMAT=NC4 20150520_0164.tif foo.nc and then some scripting with NCO to extract dates from filenames and then concatenate, but I was wondering whether I might do this more effectively in Python using xarray and it's new rasterio backend. I can read a file easily: import glob import xarray as xr f = glob.glob('*.tif') da = xr.open_rasterio(f[0]) da

Speeding up reading of very large netcdf file in python

不问归期 提交于 2019-11-29 01:16:56
问题 I have a very large netCDF file that I am reading using netCDF4 in python I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once. The 1st dimension represents time, and the next 2 represent latitude and longitude respectively. import netCDF4 nc_file = netCDF4.Dataset(path_file, 'r', format='NETCDF4') for yr in years: nc_file.variables[variable_name][int(yr), :, :] However, reading one year at a time is

Dask: How would I parallelize my code with dask delayed?

落爺英雄遲暮 提交于 2019-11-28 21:18:02
This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it. I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an @delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop. My code passes files through a function that contains

How do I change rows and columns in a dask dataframe?

蓝咒 提交于 2019-11-28 07:39:58
问题 There are few issues I am having with Dask Dataframes. lets say I have a dataframe with 2 columns ['a','b'] if i want a new column c = a + b in pandas i would do : df['c'] = df['a'] + df['b'] In dask I am doing the same operation as follows: df = df.assign(c=(df.a + df.b).compute()) is it possible to write this operation in a better way, similar to what we do in pandas? Second question is something which is troubling me more. In pandas if i want to change the value of 'a' for row 2 & 6 to np

How do I stop a running task in Dask?

坚强是说给别人听的谎言 提交于 2019-11-28 07:31:28
问题 When using Dask's distributed scheduler I have a task that is running on a remote worker that I want to stop. How do I stop it? I know about the cancel method, but this doesn't seem to work if the task has already started executing. 回答1: If it's not yet running If the task has not yet started running you can cancel it by cancelling the associated future future = client.submit(func, *args) # start task future.cancel() # cancel task If you are using dask collections then you can use the client