Are GPU/CUDA cores SIMD ones?

Let's take the nVidia Fermi Compute Architecture. It says:

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each.

[...]

Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU).

[...]

In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. V

From what I know, and what is unclear for me, is that GPUs execute the threads in so called warps, each warp consists of ~32 threads. Each warp is assigned to only one core (is that true?). So does that mean, that each of the 32 cores of a single SM is a SIMD processor, where a single instruction handles 32 data portions ? If so, then why we say there are 32 threads in a warp, not a single SIMD thread? Why cores are sometimes referred to as scalar processors, not vector processors ?

Each warp is assigned to only one core (is that true?).

No, it's not true. A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).

Cores are in fact scalar processors, not vector processors. 32 cores (or execution units) are marshalled by the warp scheduler to execute a single instruction, across 32 threads, which is where the "SIMT" moniker comes from.

CUDA "cores" can be thought of as SIMD lanes.

First let's recall that the term "CUDA core" is nVIDIA marketing-speak. These are not cores the same way a CPU has cores. Similarly, "CUDA threads" are not the same as the threads we know on CPUs.

The equivalent of a CPU core on a GPU is a "symmetric multiprocessor": It has its own instruction scheduler/dispatcher, its own L1 cache, its own shared memory etc. It is CUDA thread blocks rather than warps that are assigned to a GPU core, i.e. to a streaming multiprocessor. Within an SM, warps get selected to have instructions scheduled, for the entire warp. From a CUDA perspective, those are 32 separate threads, which are instruction-locked; but that's really no different than saying that a warp is like a single thread, which only executes 32-lane-wide SIMD instructions. Of course this isn't a perfect analogy, but I feel it's pretty sound. Something you don't quite have on CPU SIMD lanes is a masking of which lanes are actively executing, where inactive lanes will have not have the effect of active lanes' setting of register values, memory writes etc.

I hope this makes intuitive sense to you (or perhaps you've figured this out yourself over the past 2 years).

来源：https://stackoverflow.com/questions/28283970/are-gpu-cuda-cores-simd-ones

标签

cuda

gpu

gpgpu

simd