Julia: Parallel CUSPARSE calculations on multiple GPUs
问题 I have n separate GPUs, each storing its own data. I would like to have each of them perform a set of calculations simultaneously. The CUDArt documentation here describes the use of streams to asynchronously call custom C kernels in order to achieve parallelization (see also this other example here). With custom kernels, this can be accomplished through the use of the stream argument in CUDArt's implementation of the launch() function. As far as I can tell, however, the CUSPARSE (or CUBLAS)