gpgpu | 易学教程

How to write a fragment shader in GLSL to sort an array of 9 floating point numbers

阅读更多关于 How to write a fragment shader in GLSL to sort an array of 9 floating point numbers

I am writing a fragment shader in order to median 9 images together. I have never worked with GLSL before, but it seemed like the right tool for the job, as OpenCL isn't available on iOS and medianing on the CPU is inefficient. Here's what I have so far: uniform sampler2D frames[9]; uniform vec2 wh; void main(void) { vec4 sortedFrameValues[9]; float sortedGrayScaleValues[9]; for (int i = 0; i < 9; i++) { sortedFrameValues[i] = texture2D(frames[i], -gl_FragCoord.xy / wh); sortedGrayScaleValues[i] = dot(sortedFrameValues[i].xyz, vec3(0.299, 0.587, 0.114)); } // TODO: Sort sortedGrayScaleValues

L2 cache in NVIDIA Fermi

阅读更多关于 L2 cache in NVIDIA Fermi

When looking at the name of the performance counters in NVIDIA Fermi architecture (the file Compute_profiler.txt in the doc folder of cuda), I noticed that for L2 cache misses, there are two performance counters, l2_subp0_read_sector_misses and l2_subp1_read_sector_misses. They said that these are for two slices of L2. Why do they have two slices of L2? Is there any relation with the Streaming Multi-processor architecture? What would be the effect of this division to the performance? Thanks I don't think there is any direct relation with the streaming multiprocessor. I just think that slice is

forceinline effect at CUDA C device functions

阅读更多关于 __forceinline__ effect at CUDA C __device__ functions

问题 There is a lot of advice on when to use inline functions and when to avoid it in regular C coding. What is the effect of __forceinline__ on CUDA C __device__ functions? Where should they be used and where be avoided? 回答1: Normally the nvcc device code compiler will make it's own decisions about when to inline a particular __device__ function and generally speaking, you probably don't need to worry about overriding that with the __forceinline__ decorator/directive. cc 1.x devices don't have

Are GPU/CUDA cores SIMD ones?

阅读更多关于 Are GPU/CUDA cores SIMD ones?

问题 Let's take the nVidia Fermi Compute Architecture. It says: The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. [...] Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). [...] In Fermi, the newly designed integer ALU supports full 32-bit precision

Overlapping transfers and device computation in OpenCL

阅读更多关于 Overlapping transfers and device computation in OpenCL

问题 I am a beginner with OpenCL and I have difficulties to understand something. I want to improve the transfers of an image between host and device. I made a scheme to better understand me. Top: what I have now | Bottom: what I want HtD (Host to Device) and DtH ( Device to Host) are memory transfers. K1 and K2 are kernels. I thought about using mapping memory, but the first transfer (Host to Device) is done with the clSetKernelArg() command, no ? Or do I have to cut my input image into sub-image

How to quickly compact a sparse array with CUDA C?

阅读更多关于 How to quickly compact a sparse array with CUDA C?

Summary Array [A - B - - - C] in device memory but want [A B C] - what's the quickest way with CUDA C? Context I have an array A of integers on device (GPU) memory. At each iteration, I randomly choose a few elements that are larger than 0 and subtract 1 from them. I maintain a sorted lookup array L of those elements that are equal to 0: Array A: @ iteration i: [0 1 0 3 3 2 0 1 2 3] @ iteration i + 1: [0 0 0 3 2 2 0 1 2 3] Lookup for 0-elements L: @ iteration i: [0 - 2 - - - 6 - - -] -> want compacted form: [0 2 6] @ iteration i + 1: [0 1 2 - - - 6 - - -] -> want compacted form: [0 1 2 6] (

Is it worth offloading FFT computation to an embedded GPU?

阅读更多关于 Is it worth offloading FFT computation to an embedded GPU?

问题 We are considering porting an application from a dedicated digital signal processing chip to run on generic x86 hardware. The application does a lot of Fourier transforms, and from brief research, it appears that FFTs are fairly well suited to computation on a GPU rather than a CPU. For example, this page has some benchmarks with a Core 2 Quad and a GF 8800 GTX that show a 10-fold decrease in calculation time when using the GPU: http://www.cv.nrao.edu/~pdemores/gpu/ However, in our product,

Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

阅读更多关于 Are GPU Kepler CC3.0 processors not only pipelined architecture, but also superscalar? [closed]

In the documentation for CUDA 6.5 has written: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz3PIXMTktb 5.2.3. Multiprocessor Level ... 8L for devices of compute capability 3.x since a multiprocessor issues a pair of instructions per warp over one clock cycle for four warps at a time, as mentioned in Compute Capability 3.x. Does this mean that the GPU Kepler CC3.0 processors are not only pipelined architecture, but also superscalar? Pipelining - these two sequences execute in parallel (different operations at one time): LOAD [addr1] -> ADD -> STORE [addr1] -> NOP NOP ->

Why use SIMD if we have GPGPU? [closed]

阅读更多关于 Why use SIMD if we have GPGPU? [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 5 years ago . Now that we have GPGPUs with languages like CUDA and OpenCL, do the multimedia SIMD extensions (SSE/AVX/NEON) still serve a purpose? I read an article recently about how SSE instructions could be used to accelerate sorting networks. I thought this was pretty neat but when I

Does Apache Mesos recognize GPU cores?

阅读更多关于 Does Apache Mesos recognize GPU cores?

In slide 25 of this talk by Twitter's Head of Open Source office, the presenter says that Mesos allows one to track and manage even GPU (I assume he meant GPGPU) resources. But I cant find any information on this anywhere else. Can someone please help? Besides Mesos, are there other cluster managers that support GPGPU? Mesos does not yet provide direct support for (GP)GPUs, but does support custom resource types. If you specify --resources="gpu(*):8" when starting the mesos-slave, then this will become part of the resource offer to frameworks, which can launch tasks that claim to use these