cuda | 易学教程

Cuda global memory load and store

阅读更多关于 Cuda global memory load and store

问题 So I am trying to hide global memory latency. Take the following code: for(int i = 0; i < N; i++){ x = global_memory[i]; ... do some computation on x ... global_memory[i] = x; } I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code: x_next = global_memory[0]; for(int i = 0; i < N; i++){ x = x_next; x_next = global_memory[i+1]; ... do some computation on x ... global_memory[i] =

Interpreting compute workload analysis in Nsight Compute

阅读更多关于 Interpreting compute workload analysis in Nsight Compute

问题 Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines. My questions are: 1) What are the full names of ADU, CBU, TEX, XU ? How do they map

Interpreting compute workload analysis in Nsight Compute

阅读更多关于 Interpreting compute workload analysis in Nsight Compute

The behavior of __CUDA_ARCH__ macro

阅读更多关于 The behavior of __CUDA_ARCH__ macro

问题 In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch). Can anyone confirm this is correct? 回答1: __CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled

cuda shared memory - inconsistent results

阅读更多关于 cuda shared memory - inconsistent results

问题 I'm trying to do a parallel reduction to sum an array in CUDA. Currently i pass an array in which to store the sum of the elements in each block. This is my code: #include <cstdlib> #include <iostream> #include <cuda.h> #include <cuda_runtime_api.h> #include <helper_cuda.h> #include <host_config.h> #define THREADS_PER_BLOCK 256 #define CUDA_ERROR_CHECK(ans) { gpuAssert((ans), __FILE__, __LINE__); } using namespace std; inline void gpuAssert(cudaError_t code, char *file, int line, bool abort

cuda shared memory - inconsistent results

阅读更多关于 cuda shared memory - inconsistent results

Why launch a multiple of 32 number of threads in CUDA?

阅读更多关于 Why launch a multiple of 32 number of threads in CUDA?

问题 I took a course in CUDA parallel programming and I have seen many examples of CUDA thread configuration where it is common to round up the number of threads needed to the closest multiple of 32. I understand that threads are grouped into warps, and that if you launch 1000 threads, the GPU will round it up to 1024 anyways, so why do it explicitly then? 回答1: The advice is generally given in the context of situations where you might conceivably choose various threadblock sizes to solve the same

Implementation of bit rotate operators using SIMD in CUDA

阅读更多关于 Implementation of bit rotate operators using SIMD in CUDA

问题 I know that StackOverflow is not meant for asking code to other persons, but let me speak. I am trying to implement some AES functions in CUDA C++ device code. While trying to implement the left bytewise rotate operator, I was disconcerted to see that there was no native SIMD intrisic for that. So I began a naive implementation, but....it's huge, and while I haven't tried it yet, it just won't be fast because of the expensive unpacking/packing... So, is there a mean to do a per byte bit

Add nvidia runtime to docker runtimes

阅读更多关于 Add nvidia runtime to docker runtimes

问题 I’m running a virtual vachine on GCP with a tesla GPU. And try to deploy a PyTorch -based app to accelerate it with GPU. I want to make docker use this GPU, have access to it from containers. I managed to install all drivers on host machine, and the app runs fine there, but when I try to run it in docker (based on nvidia/cuda container) pytorch fails: File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 82, in _check_driver http://www.nvidia.com/Download/index.aspx""")

Add nvidia runtime to docker runtimes

阅读更多关于 Add nvidia runtime to docker runtimes