x64 allows less threads per block than Win32?

99封情书 提交于 2019-12-11 13:26:36

问题


When I am executing some cuda kernel, I noticed that for the many of my own cuda kernels, x64 build would cause failure, whereas Win32 would not.

I am very confused because the cuda source code are the same, and build is fine. It is just when x64 executes, it says it requests too much resource to launch. But shouldn't x64 allows more resources than Win32 in conceptually?

I normally like to use 1024 threads per block if it is possible. So to make x64 code work, I have to downsize the block to 256.

Any one has any idea?


回答1:


Yes, it's possible. Presumably the issue you are talking about is a registers-per-thread issue.

In 32-bit mode, all pointers are 32-bits and require only one 32-bit register for storage on the GPU. With the exact same source code, those pointers will require 64-bits for storage and therefore will effectively require two 32-bit registers (and, as @njuffa points out below, certain other types can change their size as well, requiring double the registers.) The number of available 32-bit registers is a hardware limit that does not change whether compiling for 32-bit or 64-bit mode, but pointer storage will use twice as many registers in 64-bit mode.

Pointer arithmetic (or arithmetic involving any of the types that increase in size) may likewise be impacted, as some of it may need to be done using 64-bit arithmetic vs. 32-bit arithmetic.

If these registers-per-thread increases in 64-bit mode place your overall usage over the limit, then you will have to use one of a variety of methods to manage it. You've mentioned one already: reduce the number of threads. You can also investigate the nvcc -maxrregcount ... switch, and/or the launch bounds directive.



来源:https://stackoverflow.com/questions/35323687/x64-allows-less-threads-per-block-than-win32

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!