cuda

Cuda global memory load and store

倾然丶 夕夏残阳落幕 提交于 2021-01-27 19:33:23
问题 So I am trying to hide global memory latency. Take the following code: for(int i = 0; i < N; i++){ x = global_memory[i]; ... do some computation on x ... global_memory[i] = x; } I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code: x_next = global_memory[0]; for(int i = 0; i < N; i++){ x = x_next; x_next = global_memory[i+1]; ... do some computation on x ... global_memory[i] =

Interpreting compute workload analysis in Nsight Compute

隐身守侯 提交于 2021-01-27 15:52:53
问题 Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines. My questions are: 1) What are the full names of ADU, CBU, TEX, XU ? How do they map

Interpreting compute workload analysis in Nsight Compute

一世执手 提交于 2021-01-27 15:19:14
问题 Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines. My questions are: 1) What are the full names of ADU, CBU, TEX, XU ? How do they map

The behavior of __CUDA_ARCH__ macro

ぃ、小莉子 提交于 2021-01-27 14:07:10
问题 In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch). Can anyone confirm this is correct? 回答1: __CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled

cuda shared memory - inconsistent results

三世轮回 提交于 2021-01-27 06:32:00
问题 I'm trying to do a parallel reduction to sum an array in CUDA. Currently i pass an array in which to store the sum of the elements in each block. This is my code: #include <cstdlib> #include <iostream> #include <cuda.h> #include <cuda_runtime_api.h> #include <helper_cuda.h> #include <host_config.h> #define THREADS_PER_BLOCK 256 #define CUDA_ERROR_CHECK(ans) { gpuAssert((ans), __FILE__, __LINE__); } using namespace std; inline void gpuAssert(cudaError_t code, char *file, int line, bool abort

cuda shared memory - inconsistent results

偶尔善良 提交于 2021-01-27 06:31:38
问题 I'm trying to do a parallel reduction to sum an array in CUDA. Currently i pass an array in which to store the sum of the elements in each block. This is my code: #include <cstdlib> #include <iostream> #include <cuda.h> #include <cuda_runtime_api.h> #include <helper_cuda.h> #include <host_config.h> #define THREADS_PER_BLOCK 256 #define CUDA_ERROR_CHECK(ans) { gpuAssert((ans), __FILE__, __LINE__); } using namespace std; inline void gpuAssert(cudaError_t code, char *file, int line, bool abort

Why launch a multiple of 32 number of threads in CUDA?

旧街凉风 提交于 2021-01-27 05:31:42
问题 I took a course in CUDA parallel programming and I have seen many examples of CUDA thread configuration where it is common to round up the number of threads needed to the closest multiple of 32. I understand that threads are grouped into warps, and that if you launch 1000 threads, the GPU will round it up to 1024 anyways, so why do it explicitly then? 回答1: The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same

Implementation of bit rotate operators using SIMD in CUDA

怎甘沉沦 提交于 2021-01-27 04:35:29
问题 I know that StackOverflow is not meant for asking code to other persons, but let me speak. I am trying to implement some AES functions in CUDA C++ device code. While trying to implement the left bytewise rotate operator, I was disconcerted to see that there was no native SIMD intrisic for that. So I began a naive implementation, but....it's huge, and while I haven't tried it yet, it just won't be fast because of the expensive unpacking/packing... So, is there a mean to do a per byte bit

Add nvidia runtime to docker runtimes

浪子不回头ぞ 提交于 2021-01-26 04:39:55
问题 I’m running a virtual vachine on GCP with a tesla GPU. And try to deploy a PyTorch -based app to accelerate it with GPU. I want to make docker use this GPU, have access to it from containers. I managed to install all drivers on host machine, and the app runs fine there, but when I try to run it in docker (based on nvidia/cuda container) pytorch fails: File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 82, in _check_driver http://www.nvidia.com/Download/index.aspx""")

Add nvidia runtime to docker runtimes

大城市里の小女人 提交于 2021-01-26 04:38:13
问题 I’m running a virtual vachine on GCP with a tesla GPU. And try to deploy a PyTorch -based app to accelerate it with GPU. I want to make docker use this GPU, have access to it from containers. I managed to install all drivers on host machine, and the app runs fine there, but when I try to run it in docker (based on nvidia/cuda container) pytorch fails: File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 82, in _check_driver http://www.nvidia.com/Download/index.aspx""")