Artificially downgrade CUDA compute capabilities to simulate other hardware

问题

I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that.

Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with smaller amount of memory and/or with less advanced compute capability?

Per comments clarifying what I'm asking.

Suppose a customer reports a problem running on a GPU with compute capability C with M gigs of GPU memory and T threads per block. I have a better GPU on my machine, with higher compute capability, more memory, and more threads per block.

Can I run my program on my GPU restricted to M gigs of GPU memory? The answer to this one seems to be "yes, just allocate (whatever mem you have) - M at startup and never use it; that would leave only M until your program exits."
Can I reduce the size of the blocks on my GPU to no more than T threads for the duration of runtime?
Can I reduce compute capability of my GPU for the duration of runtime, as seen by my program?

回答1:

I originally wanted to make this a comment but it was getting far too big for that scope.

As @RobertCrovella mentioned there is no native way to do what you are asking for. That said, you can take the following measures to minimize the bugs you see on other architectures.

0) Try to get the output from cudaGetDeviceProperties from the CUDA GPUs you want to target. You could crowd source this from your users or the community.

1) To restrict memory, you can either implement a memory manager and manually keep track of the memory being used or use cudaGetMemInfo to get a fairly close estimate. Note: This function returns memory used by other applications as well.

2) Have a wrapper macro to launch the kernel where you can explicitly check if the number of blocks / threads fit in the current profile. i.e. Instead of launching

kernel<float><<<blocks, threads>>>(a, b, c);

You'd do something like this:

LAUNCH_KERNEL((kernel<float>), blocks, threads, a, b, c);

Where you can have the macro be defined like this:

#define LAUNCH_KERNEL(kernel, blocks, threads, ...)\
        check_blocks(blocks);\
        check_threads(threads);\
        kernel<<<blocks, threads>>>(__VA_ARGS__)

3) Reducing the compute capability is not possible, but you can however compile your code with various compute modes and make sure your kernels have backwards compatible code in them. If a certain part of your kernel errors out with an older compute mode, you can do something like this:

#if !defined(TEST_FALLBACK) && __CUDA_ARCH__ >= 300 // Or any other newer compute
// Implement using new fancy feature
#else
// Implement a fallback version
#endif

You can define TEST_FALLBACK whenever you want to test your fallback code and ensure your code works on older computes.

来源：https://stackoverflow.com/questions/39044236/artificially-downgrade-cuda-compute-capabilities-to-simulate-other-hardware

标签

cuda