nvidia | 易学教程

How to turn off errors/warnings in Eclipse due to OpenCL/CUDA syntax?

阅读更多关于 How to turn off errors/warnings in Eclipse due to OpenCL/CUDA syntax?

I am using Eclipse as an editor for OpenCL and I turned on syntax highlighting for *.cl files to behave like C++ code. It works great, but all my code is underlined as syntax errors. Is there a way that I can have my syntax highlighting and turn off the errors/warnings just for my *.cl files? First, the Eclipse syntax highlighter is programmed to the grammar of C and C++, and not OpenCL, so it is unaware of the syntactic extensions of OpenCL, such as New keywords New data types I suggest that the new keywords can be conditionally defined to nothing e.g. #define __kernel #define __global and

Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

阅读更多关于 Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

In the documentation for CUDA 6.5 has written: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb 5.2.3. Multiprocessor Level ... 8L for devices of compute capability 3.x since a multiprocessor issues a pair of instructions per warp over one clock cycle for four warps at a time, as mentioned in Compute Capability 3.x. Does this mean that the GPU Kepler CC3.0 processors are not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different operations at one time): LOAD [addr1] -> ADD -> STORE [addr1] -> NOP NOP ->

Maximum blocks per grid:CUDA

阅读更多关于 Maximum blocks per grid:CUDA

问题 What is the maximum number of blocks in a grid that can created per kernel launch? I am slightly confused here since Now the compute capability table here says that there can be 65535 blocks per grid dimemsion in CUDA compute capability 2.0. Does that mean the total number of blocks = 65535*65535? Or does it mean that you can rearrange at most 65535 into a 1d grid of 65536 blocks or 2d grid of sqrt(65535) * sqrt(65535) ? Thank you. 回答1: 65535 per dimension of the grid. On compute 1.x cards,

Compile OpenCL on Mingw Nvidia SDK

阅读更多关于 Compile OpenCL on Mingw Nvidia SDK

问题 Is it possible to compile OpenCL using Mingw and Nvidia SDK? I'm aware that its not officially supported but that just doesn't make sense. Aren't the libraries provided as a statically linked libraries? I mean once compiled with whatever compiler that may be, and linked successfully, whats should be the problem? I managed to compile and successfully link my code to OpenCL libraries provided with Nvidia's SDK, however the executable throws Segmentation Fault at clGetPlatformIDs which is the

NVIDIA NVML Driver/library version mismatch

阅读更多关于 NVIDIA NVML Driver/library version mismatch

问题 When I run nvidia-smi I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the same message and uninstalled my cuda library and I was able to run nvidia-smi , getting the following result: After this I downloaded cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb from the official NVIDIA page and then simply: sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb sudo apt-get update sudo apt-get install cuda export PATH=

OpenCL crashes on call to clGetPlatformIDs

阅读更多关于 OpenCL crashes on call to clGetPlatformIDs

I am new to OpenCL. Working on a Core i5 machine with Intel(R) HD Graphics 4000, running Windows 7. I installed the newest Intel driver with support for OpenCL. GpuCapsViewer confirms I have OpenCL support setup. I Developed a simple HelloWorld program using Intel OpenCL SDK. I successfully compile the program but when run, it crashes upon call to clGetPlatformIDs() with a segmentation fault. This is my code: #include <iostream> #include <CL/opencl.h> int main() { std::cout << "Test OCL without driver" << std::endl; cl_int err; cl_uint num_platforms; err = clGetPlatformIDs(0, NULL, &num

OpenCL read variable size result buffer from the GPU

阅读更多关于 OpenCL read variable size result buffer from the GPU

问题 I have one searching OpenCL 1.1 algorithm which works well with small amount of data: 1.) build the inputData array and pass it to the GPU 2.) create a very big resultData container (e.g. 200000 * sizeof (cl_uint) ) and pass this one too 3.) create the resultSize container (inited to zero) which can be access via atomic operation (at least I suppose this) When one of my workers has a result it copies that into the the resultData buffer and increments the resultSize in an atomic inc operation

Can't we use atomic operations for floating point variables in CUDA?

阅读更多关于 Can't we use atomic operations for floating point variables in CUDA?

I have used atomicMax() to find the maximum value in the CUDA kernel: __global__ void global_max(float* values, float* gl_max) { int i=threadIdx.x + blockDim.x * blockIdx.x; float val=values[i]; atomicMax(gl_max, val); } It is throwing the following error: error: no instance of overloaded function "atomicMax" matches the argument list The argument types are: (float *, float) . The short answer is no. As you can see from the atomic function documentation , only integer arguments are supported for atomicMax and 64 bit integer arguments are only supported on compute capability 3.5 devices.

Find max/min in CUDA without passing it to the CPU

阅读更多关于 Find max/min in CUDA without passing it to the CPU

问题 I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application. Is there a way to compute this index efficiently and store it in the GPU? Thanks! 回答1: Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than

How many threads per core are assumed when calculating GFLOPS of Nvidia GPU cards?

阅读更多关于 How many threads per core are assumed when calculating GFLOPS of Nvidia GPU cards?

问题 I am interested in obtaining the number of nano seconds it would take to execute 1 double precision FLOP on GeForce GTX 550 Ti. In order to do that I am following this approach: I found out that the single precision peak performance of the card is 691.2 GFLOPS, which means the double precision peak performance would be 1/8 of it i.e. 86.4 GFLOPS. Then in order to obtain FLOPS per core, I divide the 86.4 GFLOPS by the number of cores, 192, which gives me 0.45 GFLOPS per core. 0.45 GFLOPS means