Given that only one block can be executed on each SM at a time,
This statement is fundamentally incorrect. Barring resource conflicts, and assuming enough threadblocks in a kernel (i.e. the grid), an SM will generally have multiple threadblocks assigned to it.
The basic unit of execution is the warp. A warp consists of 32 threads, executed together in lockstep by an SM, on an instruction-cycle by instruction-cycle basis.
Therefore, even within a single threadblock, an SM will generally have more than a single warp "in flight". This is essential for good performance to allow the machine to hide latency.
There is no conceptual difference between choosing warps from the same threadblock to execute, or warps from different threadblocks. SMs can have multiple threadblocks resident on them (i.e. with resources such as registers and shared memory assigned to each resident threadblock), and the warp scheduler will choose from amongst all the warps in all the resident threadblocks, to select the next warp for execution on any given instruction cycle.
Therefore, the SM has a greater number of threads that can be "resident" because it can support more than a single block, even if that block is maximally configured with threads (512, in this case). We utilize more than the threadblock limit by having multiple threadblocks resident.
You may also want to research the idea of occupancy in GPU programs.