I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core
I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.