nvidia | 易学教程

How does instruction level parallelism and thread level parallelism work on GPUs?

阅读更多关于 How does instruction level parallelism and thread level parallelism work on GPUs?

问题 Let's say I'm trying to do a simple reduction over an array size n, say kept within one work unit... say adding all the elements. The general strategy seems to be to spawn a number of work items on each GPU, which reduce items in a tree. Naively this would seem to take log n steps, but it's not as if the first wave of threads all do these threads go in one shot, is it? They get scheduled in warps. for(int offset = get_local_size(0) / 2; offset > 0; offset >>= 1) { if (local_index < offset) {

How to properly link cuda header file with device functions?

阅读更多关于 How to properly link cuda header file with device functions?

问题 I'm trying to decouple my code a bit and something fails. Compilation error: error: calling a __host__ function("DecoupledCallGpu") from a __global__ function("kernel") is not allowed Code excerpt: main.c (has a call to cuda host function): #include "cuda_compuations.h" ... ComputeSomething(&var1,&var2); ... cuda_computations.cu (has kernel, host master functions and includes header which has device unctions): #include "cuda_computations.h" #include "decoupled_functions.cuh" ... __global__

How to use GPU to accelerate the processing speed of ffmpeg filter?

阅读更多关于 How to use GPU to accelerate the processing speed of ffmpeg filter?

问题 According to NVIDIA's developer website, you can use GPU to speed up the rendering of the ffmpeg filter. Create high-performance end-to-end hardware-accelerated video processing, 1:N encoding and 1:N transcoding pipeline using built-in > filters in FFmpeg Ability to add your own custom high-performance CUDA filters using the shared CUDA context implementation in FFmpeg The problem I am having now is how to use the GPU to speed up multiple ffmpeg filter processing? For example: ffmpeg -loop 1

Fedora 19 using rpmfussion's NVIDIA driver: libGL error: failed to load driver: swrast

阅读更多关于 Fedora 19 using rpmfussion's NVIDIA driver: libGL error: failed to load driver: swrast

When running an app that uses Qt 4.7 on my Fedora 19 box I am getting the following errors from the application: libGL: screen 0 does not appear to be DRI2 capable libGL: OpenDriver: trying /usr/lib64/dri/tls/swrast_dri.so libGL: OpenDriver: trying /usr/lib64/dri/swrast_dri.so libGL: Can't open configuration file /home/Matthew.Hoggan/.drirc: No such file or directory. libGL error: failed to load driver: swrast ERROR: Error failed to create progam. I do not see these errors in a stock X11 application where the context is configured using glx. I am assuming this is because Qt uses egl underneath

Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)

阅读更多关于 Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)

I had a quick look on the forums and I don't think this question has been asked already. I am currently working with an MPI/CUDA hybrid code, made by somebody else during his PhD. Each CPU has its own GPU. My task is to gather data by running the (already working) code, and implement extra things. Turning this code into a single CPU / Multi-GPU one is not an option at the moment (later, possibly.). I would like to make use of performance profiling tools to analyse the whole thing. For now an idea is to have each CPU launch nvvp for its own GPU and gather data, while another profiling tool will

Equivalent of cudaGetErrorString for cuBLAS?

阅读更多关于 Equivalent of cudaGetErrorString for cuBLAS?

CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error handling. I'm familiarizing myself with cuBLAS now, and I'd like to create a macro similar to CUDA_SAFE_CALL for cuBLAS. To make my macro's printouts useful, I'd like to have something analogous to cudaGetErrorString in cuBLAS. Is there an equivalent of cudaGetErrorString() in cuBLAS? Or, have any cuBLAS users written a function like this? In CUDA 5.0,

Execute automatic change connected displays in Windows 8

阅读更多关于 Execute automatic change connected displays in Windows 8

Short version How do I automate changing multiple display settings? NVIDIA, 3x monitors (2x DVI and 1x HDMI), GPU only supports 2 active monitors. Long version So I have a NVIDIA GeForce GTX 560 Ti which can run two displays simultaneously. It has two DVI connections and one HDMI . I often swap from using my two desktop monitors and connect only one of the desktop monitors plus my TV using HDMI . I would like to automate the change back and forward using a batch script or other program instead of using the windows control panel (Control Panel\All Control Panel Items\Display\Screen Resolution)

VAO and element array buffer state

阅读更多关于 VAO and element array buffer state

问题 I was recently writing some OpenGL 3.3 code with Vertex Array Objects (VAO) and tested it later on Intel graphics adapter where I found, to my disappointment, that element array buffer binding is evidently not part of VAO state, as calling: glBindVertexArray(my_vao); glDrawElements(GL_TRIANGLE_STRIP, count, GL_UNSIGNED_INTEGER, 0); had no effect, while: glBindVertexArray(my_vao); glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, my_index_buffer); // ? glDrawElements(GL_TRIANGLE_STRIP, count, GL_UNSIGNED

CUDA streams destruction and CudaDeviceReset

阅读更多关于 CUDA streams destruction and CudaDeviceReset

I have implemented the following class using CUDA streams class CudaStreams { private: int nStreams_; cudaStream_t* streams_; cudaStream_t active_stream_; public: // default constructor CudaStreams() { } // streams initialization void InitStreams(const int nStreams = 1) { nStreams_ = nStreams; // allocate and initialize an array of stream handles streams_ = (cudaStream_t*) malloc(nStreams_*sizeof(cudaStream_t)); for(int i = 0; i < nStreams_; i++) CudaSafeCall(cudaStreamCreate(&(streams_[i]))); active_stream_ = streams_[0];} // default destructor ~CudaStreams() { for(int i = 0; i<nStreams_; i++

CUDA global (as in C) dynamic arrays allocated to device memory

阅读更多关于 CUDA global (as in C) dynamic arrays allocated to device memory

So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device. As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance. So, I have device functions that want to access a device allocated array. Ideally, I could do something like: __device__ float* global_data; main() { cudaMalloc(global_data); kernel1<<