What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized
For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here: https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).