Does anyone know related information about L2 cache in Fermi? I have heard that it is as slow as global memory, and the use of L2 is just to enlarge the memory bandwidth. Bu
It is not just as slow as global memory. I don't have a source explicitly saying that but on the CUDA programming guide it says "A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise." so they should be different for this to make any sense and why would NVIDIA put a cache with the same speed of global memory? It would be worse on average because of cache misses.
About the latency I don't know. The size of the L2 cache is 768KB, the line size is 128 bytes. Section F4 of the CUDA programming guide has some more bits of information, specially section F4.1 and F4.2. The guide is available here http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf