问题
From a library, I get a function that reads a file and returns a numpy array.
I want to build a Dask array with multiple blocks from multiple files.
Each block is the result of calling the function on a file.
When I ask Dask to compute, will Dask asks the functions to read multiple files from the hard disk at the same time?
If that is the case, how to avoid that? My computer doesn't have a parallel file system.
Example:
import numpy as np
import dask.array as da
import dask
# Make test data
n = 2
m = 3
x = np.arange(n * m, dtype=np.int).reshape(n, m)
np.save('0.npy', x)
np.save('1.npy', x)
# np.load is a function that reads a file
# and returns a numpy array.
# Build delayed
y = [dask.delayed(np.load)('%d.npy' % i)
for i in range(2)]
# Build individual Dask arrays.
# I can get the shape of each numpy array without
# reading the whole file.
z = [da.from_delayed(a, (n, m), np.int) for a in y]
# Combine the dask arrays
w = da.vstack(z)
print(w.compute())
回答1:
You could use an distributed lock primitive - so that your loader function does acquire-read-release.
read_lock = distributed.Lock('numpy-read')
@dask.delayed
def load_numpy(lock, fn):
lock.acquire()
out = np.load(fn)
lock.release()
return out
y = [load_numpy(lock, '%d.npy' % i) for i in range(2)]
Also, da.from_array accepts a lock, so you could maybe create individual arrays from the delayed function np.load directly supplying the lock.
Alternatively, you could assign a single unit of resource to the worker (with multiple threads), and then compute (or persist) with a requirement of one unit per file-read task, as in the example in the linked doc.
Response to comment: to_hdf wasn't specified in the question, I am not sure why it is being questioned now; however, you can use da.store(compute=False) with a h5py.File, and then specify the resource to use when calling compute. Note that this does not materialise the data into memory.
来源:https://stackoverflow.com/questions/51696684/avoid-simultaneously-reading-multiple-files-for-a-dask-array