How to pass data bigger than the VRAM size into the GPU?

问题

I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

I created this code to recreate the problem:

from numba import cuda
import numpy as np


@cuda.jit()
def addingNumbers (big_array, big_array2, save_array):
    i = cuda.grid(1)
    if i < big_array.shape[0]:
        for j in range (big_array.shape[1]):
            save_array[i][j] = big_array[i][j] * big_array2[i][j]



big_array = np.random.random_sample((1000000, 500))
big_array2  = np.random.random_sample((1000000, 500))
save_array = np.zeros(shape=(1000000, 500))


arraysize = 1000000
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))


d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)

addingNumbers[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)

save_array = d_save_array.copy_to_host()

Is there a way to dynamically pass data into the GPU to be able to handle more data than the VRAM can hold? If not, what would be the recommended way to manually pass all this data to the gpu. Is using dask_cuda an option, or something of that nature?

回答1:

A well-written example of how to take a larger problem (i.e. dataset) and break it into pieces, and handle the processing piece-wise in numba CUDA is here. In particular, the variant of interest is pricer_cuda_overlap.py. Unfortunately that example makes use of what I believe is deprecated random number generation functionality in accelerate.cuda.rand, so it's not directly runnable in today's numba (I think).

However for the purposes of the question here, the random number generation process is irrelevant, and so we can simply remove that without affecting the important observations. What follows then is a single file assembled from various pieces in various files in that example:

$ cat t45.py
#! /usr/bin/env python
"""
This version demonstrates copy-compute overlapping through multiple streams.
"""
from __future__ import print_function

import math
import sys

import numpy as np

from numba import cuda, jit

from math import sqrt, exp
from timeit import default_timer as timer
from collections import deque

StockPrice = 20.83
StrikePrice = 21.50
Volatility = 0.021  #  per year
InterestRate = 0.20

Maturity = 5. / 12.

NumPath = 500000
NumStep = 200

def driver(pricer, pinned=False):
    paths = np.zeros((NumPath, NumStep + 1), order='F')
    paths[:, 0] = StockPrice
    DT = Maturity / NumStep

    if pinned:
        from numba import cuda
        with cuda.pinned(paths):
            ts = timer()
            pricer(paths, DT, InterestRate, Volatility)
            te = timer()
    else:
        ts = timer()
        pricer(paths, DT, InterestRate, Volatility)
        te = timer()

    ST = paths[:, -1]
    PaidOff = np.maximum(paths[:, -1] - StrikePrice, 0)
    print('Result')
    fmt = '%20s: %s'
    print(fmt % ('stock price', np.mean(ST)))
    print(fmt % ('standard error', np.std(ST) / sqrt(NumPath)))
    print(fmt % ('paid off', np.mean(PaidOff)))
    optionprice = np.mean(PaidOff) * exp(-InterestRate * Maturity)
    print(fmt % ('option price', optionprice))

    print('Performance')
    NumCompute = NumPath * NumStep
    print(fmt % ('Mstep/second', '%.2f' % (NumCompute / (te - ts) / 1e6)))
    print(fmt % ('time elapsed', '%.3fs' % (te - ts)))

class MM(object):
    """Memory Manager

    Maintain a freelist of device memory for reuse.
    """
    def __init__(self, shape, dtype, prealloc):
        self.device = cuda.get_current_device()
        self.freelist = deque()
        self.events = {}
        for i in range(prealloc):
            gpumem = cuda.device_array(shape=shape, dtype=dtype)
            self.freelist.append(gpumem)
            self.events[gpumem] = cuda.event(timing=False)

    def get(self, stream=0):
        assert self.freelist
        gpumem = self.freelist.popleft()
        evnt = self.events[gpumem]
        if not evnt.query(): # not ready?
            # querying is faster then waiting
            evnt.wait(stream=stream) # future works must wait
        return gpumem

    def free(self, gpumem, stream=0):
        evnt = self.events[gpumem]
        evnt.record(stream=stream)
        self.freelist.append(gpumem)


if sys.version_info[0] == 2:
    range = xrange

@jit('void(double[:], double[:], double, double, double, double[:])',
     target='cuda')
def cu_step(last, paths, dt, c0, c1, normdist):
    i = cuda.grid(1)
    if i >= paths.shape[0]:
        return
    noise = normdist[i]
    paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)

def monte_carlo_pricer(paths, dt, interest, volatility):
    n = paths.shape[0]
    num_streams = 2

    part_width = int(math.ceil(float(n) / num_streams))
    partitions = [(0, part_width)]
    for i in range(1, num_streams):
        begin, end = partitions[i - 1]
        begin, end = end, min(end + (end - begin), n)
        partitions.append((begin, end))
    partlens = [end - begin for begin, end in partitions]

    mm = MM(shape=part_width, dtype=np.double, prealloc=10 * num_streams)

    device = cuda.get_current_device()
    blksz = device.MAX_THREADS_PER_BLOCK
    gridszlist = [int(math.ceil(float(partlen) / blksz))
                  for partlen in partlens]

    strmlist = [cuda.stream() for _ in range(num_streams)]

    # Allocate device side array - in original example this would be initialized with random numbers
    d_normlist = [cuda.device_array(partlen, dtype=np.double, stream=strm)
                  for partlen, strm in zip(partlens, strmlist)]

    c0 = interest - 0.5 * volatility ** 2
    c1 = volatility * math.sqrt(dt)

    # Configure the kernel
    # Similar to CUDA-C: cu_monte_carlo_pricer<<<gridsz, blksz, 0, stream>>>
    steplist = [cu_step[gridsz, blksz, strm]
               for gridsz, strm in zip(gridszlist, strmlist)]

    d_lastlist = [cuda.to_device(paths[s:e, 0], to=mm.get(stream=strm))
                  for (s, e), strm in zip(partitions, strmlist)]

    for j in range(1, paths.shape[1]):

        d_pathslist = [cuda.to_device(paths[s:e, j], stream=strm,
                                      to=mm.get(stream=strm))
                       for (s, e), strm in zip(partitions, strmlist)]

        for step, args in zip(steplist, zip(d_lastlist, d_pathslist, d_normlist)):
            d_last, d_paths, d_norm = args
            step(d_last, d_paths, dt, c0, c1, d_norm)

        for d_paths, strm, (s, e) in zip(d_pathslist, strmlist, partitions):
            d_paths.copy_to_host(paths[s:e, j], stream=strm)
            mm.free(d_paths, stream=strm)
        d_lastlist = d_pathslist

    for strm in strmlist:
        strm.synchronize()

if __name__ == '__main__':
    driver(monte_carlo_pricer, pinned=True)
$ python t45.py
Result
         stock price: 22.6720614385
      standard error: 0.0
            paid off: 1.17206143849
        option price: 1.07834858009
Performance
        Mstep/second: 336.40
        time elapsed: 0.297s
$

There's a lot going on in this example, and the general topic of how to write a pipelined/overlapped code in CUDA would be an entire answer by itself, so I will just cover highlights. The general topic is well covered in this blog post albeit with CUDA C++ in view, not numba CUDA (python). However there is a 1:1 correspondence between most items of interest in numba CUDA and their equivalent counterpart in CUDA C++. Therefore I will assume that basic concepts like CUDA streams, and how they are used to arrange asynchronous concurrent activity, are understood.

So what is this example doing? I'll focus mostly on the CUDA aspects.

with a view toward overlap of copy and compute operations, the input data (paths) is converted to CUDA pinned memory on the host
with a view towards handling the work in chunks, a memory manager (MM) is defined, which will allow chunk allocations of device memory to be reused as the processing proceeds.
python lists are created to represent the sequence of chunk processing. There is a list that defines the start and end of each chunk or partition. There is a list that defines the sequence of cuda streams to be used. There is a list of data array partitions that the CUDA kernel will use.
then, with these lists, there is an issuance of work in "depth-first-order". For each stream, the data (chunks) necessary for that stream are transferred to the device (queued for transfer), the kernel that will process that data is launched (queued), and the transfer that will send the results from that chunk back to host memory is queued. This process is repeated in the for j loop in monte_carlo_pricer for the number of steps (paths.shape[1]).

When I run the above code using a profiler, we can see a timeline that looks like this:

In this particular case, I am running this on a Quadro K2000, which is an old, small GPU that has only one copy engine. Therefore we see in the profile that at most 1 copy operation is overlapped with CUDA kernel activity, and there are no copy operations overlapped with other copy operations. However if I ran this on a device with 2 copy engines, I would expect a tighter/denser timeline is possible, with overlap of 2 copy operations and a compute operation at the same time, for maximum throughput. To achieve this, the streams in use (num_streams) would also have to be increased to at least 3.

The code here is not guaranteed to be defect free. It is provided for demonstration purposes. Use it at your own risk.

来源：https://stackoverflow.com/questions/56176077/how-to-pass-data-bigger-than-the-vram-size-into-the-gpu

标签

python

cuda

dask

numba

dask-distributed