gpu | 易学教程

What Causes Instruction Replay Overhead in CUDA

阅读更多关于 What Causes Instruction Replay Overhead in CUDA

问题 I ran the visual profiler on a CUDA application of mine. The application calls a single kernel multiple times if the data is too large. This kernel has no branching. The profiler reports a high instruction replay overhead of 83.6% and a high global memory instruction replay overhead of 83.5% . Here is how the kernel generally looks: // Decryption kernel __global__ void dev_decrypt(uint8_t *in_blk, uint8_t *out_blk){ __shared__ volatile word sdata[256]; register uint32_t data; // Thread ID

Tensorflow GPU Python 3.5. Eclipse has error: ImportError: libcudart.so.8.0: cannot open shared object file:

阅读更多关于 Tensorflow GPU Python 3.5. Eclipse has error: ImportError: libcudart.so.8.0: cannot open shared object file:

问题 I installed Tensorflow GPU v1.0 on Python 3.5 Anaconda envrionment. All seems fine. I can run Juputer notebook and in terminal, the following lines. It tells GPU is running fine: import tensorflow as tf sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) ====Output Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 11:58:13) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow

OpenCL on Linux with integrated intel graphic chip

阅读更多关于 OpenCL on Linux with integrated intel graphic chip

问题 I would like to use OpenCL on debian 8. I read on this page that Intel's GPUs are not supported on linux. (The article is from 2011, so I hope it is out of date.) I already installed OpenCL nontheless and can run compile and run the code found here. As to my hardware. My processor is Intel(R) Core(TM) i7-4500 CPU @ 1.80GHz lspci | grep VGA outputs 00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09) So to clearify: I want to know, if it is

“Global Load Efficiency” over 100%

阅读更多关于 “Global Load Efficiency” over 100%

问题 I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is " Ratio of global memory load throughput to required global memory load throughput. " Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it? My GPU is GeForce GTX 780

opengl depth buffer slow when points have same depth

阅读更多关于 opengl depth buffer slow when points have same depth

问题 I'm making a 2d game involving drawing huge numbers of overlapping quads to the screen. What goes in front of what doesn't really matter. If I draw each of my quads with z values from 0 upwards and have glDepthFunc(GL_LESS) set I get quite a nice speed boost as you would expect. This is to avoid having to draw quads which are either totally hidden or partially hidden behind other quads. So I draw the quads using something like: float small = (float(1)/1000000); for (int iii = 0; iii < 100000;

Usage of anonymous functions in arrayfun with GPU acceleration (Matlab)

阅读更多关于 Usage of anonymous functions in arrayfun with GPU acceleration (Matlab)

问题 I am new to the Parallel toolbox from Matlab R2012b and was wondering what the best way is to overcome the following problem. I am analyzing the neighbourhood of every pixel in an image. Which is extremely good case for parallelization. However, I can't seem to get it working. The main catch in the problem is that some "constant" arguments should be passed to the function. So the function should be called for every pixel, however, it also needs to access the surrounding pixels. (Preferable by

Vulkan: Why does the 1080 Ti have a maxMemoryAllocationCount of 4294967296 on arch but only 4096 on Windows?

阅读更多关于 Vulkan: Why does the 1080 Ti have a maxMemoryAllocationCount of 4294967296 on arch but only 4096 on Windows?

问题 I am currently building a game engine in C++ which uses vulkan for rendering. While implementing a terrain renderer I reached a hardware limit, the maxMemoryAllocationCount which limits the amount of allocated memory blocks. I checked https://vulkan.gpuinfo.org/ to see how high that value on different GPUs is. When looking at the "GeForce GTX 1080 Ti", the value is 4096 for windows but 4294967296 for arch/manjaro. Why is there a difference between those OSes, when this really should be a

Tensorflow: Model trained(checkpoint files) on GPU can be converted to CPU running model?

阅读更多关于 Tensorflow: Model trained(checkpoint files) on GPU can be converted to CPU running model?

问题 A model is trained with GPU and result is saved by checkpoint file. The saved checkpoint file can be run by cpu-tensorflow? If not, can convert the saved checkpoint file so as to run model in cpu-tensorflow? 回答1: Yes! It normally can! The exception is with tf.device('gpu:0') statements. If you do not have them in your code you are good to go! Good luck! 来源： https://stackoverflow.com/questions/42951543/tensorflow-model-trainedcheckpoint-files-on-gpu-can-be-converted-to-cpu-runni

How to search the value from a std::map when I use cuda?

阅读更多关于 How to search the value from a std::map when I use cuda?

问题 I have something stored in std::map, which maps string to vector. Its keys and values looks like key value "a"-----[1,2,3] "b"-----[8,100] "cde"----[7,10] For each thread, it needs to process one query. The query looks like ["a", "b"] or ["cde", "a"] So I need to get the value from the map and then do some other jobs like combine them. So as for the first query, the result will be [1,2,3,8,100] The problem is, how can threads access the map and find the value by a key? At first, I tried to

CUDA: How to pass multiple duplicated arguments to CUDA Kernel

阅读更多关于 CUDA: How to pass multiple duplicated arguments to CUDA Kernel

问题 I'm looking for an elegent way to pass multiple duplicated arguments in CUDA kernel, As we all know, each kernel argument is located on the stack of each CUDA thread, therefore, there might be duplication between arguments being passed by the Kernel to each thread, memory which is located on each stack. In order to minimize the number of duplicated arguments being passed, I'm looking for an elegant way doing so. In order to explain my concern: Let's say my code looks like this: kernelFunction