Understanding the process of loading multiple file contents into Dask Array and how it scales

浪尽此生 提交于 2019-12-13 17:12:22

问题


Using the example on http://dask.pydata.org/en/latest/array-creation.html

filenames = sorted(glob('2015-*-*.hdf5')
dsets = [h5py.File(fn)['/data'] for fn in filenames]
arrays = [da.from_array(dset, chunks=(1000, 1000)) for dset in dsets]
x = da.concatenate(arrays, axis=0)  # Concatenate arrays along first axis

I'm having trouble understanding the next line and whether its a dask_array of "dask arrays" or a "normal" np array which points to as many dask arrays as there were datasets in all the hdf5 files that gets returned.

Is there any increase in performance (thread or memory based) during the file read stage as a result of the da.from_array or is only when you concatenate into the dask array x where you should expect improvements


回答1:


The objects in the arrays list are all dask arrays, one for each file.

The x object is also a dask array that combines all of the results of the dask arrays in the arrays list. It isn't a dask.array of dask arrays, it's just a single flattened dask array with an a larger first dimension.

There will probably not be an increase in performance for reading data. You're likely to be I/O bound by your disk bandwidth. Most people in this situation are using dask.array because they have more data than can conveniently fit into RAM. If this isn't valuable to you then I would stick with NumPy.



来源:https://stackoverflow.com/questions/39171316/understanding-the-process-of-loading-multiple-file-contents-into-dask-array-and

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!