Slower SSE performance on large array sizes

馋奶兔 提交于 2019-12-04 15:23:01

For large input, the data is outside the cache, and the code is memory bounded.
For small input, the data is inside the cache (i.e L1 / L2 / L3 cache), and the code is computation bounded.
I assume you didn't try to flush the cache, before performance measurement.

The cache memory is inside the CPU, and the bandwidth between cache memory and ALU (or SSE) units is very high (high bandwidth - less time transferring data).
Your highest level cache (i.e L3) size is about 4MB to 8MB (depending your CPU model).
Larger amount of data must be located on the DDR SDRAM, witch is external RAM (outside the CPU).
The CPU is connected to the DDR SDRAM with memory bus, with has much lower bandwidth than the cache memory.

Example:
Assume your external RAM type is Dual Channel DDR3 SDRAM 1600. The maximum theoretical bandwidth between external RAM and CPU is about 25GB/Sec.

Reading 100MBytes of data (at 25GB/S) from the RAM to the CPU takes about 100e6 / 25e9 = 4msec.
From my experience the utilized bandwidth is about half of theoretical bandwidth, so the reading time is about 8msec.

The computation time is shorter:
Assume each iteration of your loop takes about 2 CPU clocks (just an example).
Each iteration process 16 bytes of data.
Total CPU clocks for processing 100MB takes about (100e6 / 16)*2 = 12500000 clks.
Assume CPU frequency is 3GHz.
Total SSE processing time is about 12500000 / 3e9 = 4.2msec.

As you can see, reading the data from external RAM takes twice as much as SSE computation time.

Since the data transfer and computation occur in parallel, the total time is the maximum of 4.2mesc and 8msec (i.e 8msec).

Lets assume loop without using SSE takes twice as much computation time, so without using SSE the computation time is about 8.4msec.

In the above example the total improvement of using SSE is about 0.4msec.

Note: The selected numbers are just for example purposes.


Benchmarks:
I did some benchmarks on my system.
I am using Windows 10 and Visual Studio 2010.
Benchmark test: Summing 100MBytes of data (summing 25*1024^2 32bits integers).

CPU

  • Intel Core i5 3550 (Ivy Bridge).
  • CPU Base frequency is 3.3GHz.
  • Actual Core Speed during the test: 3.6GHz (Turbo boost is enabled).
  • L1 data cache size: 32KBytes.
  • L2 cache size: 256Bytes (single core L2 cache size).
  • L3 cache size: 6MBytes.

Memory:

  • 8GB DDR3 Dual channel.
  • RAM Frequency: 666MHz (equivalent to 1333MHz without DDR).
  • Memory theoretical maximum bandwidth: (128*1333/8) / 1024 = 20.8GBytes/Sec.

  1. Sum 100MB as large chunk with SSE (data in external RAM).
    Processing time: 6.22msec
  2. Sum 1KB 100 times with SSE (data inside cache).
    Processing time: 3.86msec
  3. Sum 100MB as large chunk without SSE (data in external RAM).
    Processing time: 8.1msec
  4. Sum 1KB 100 times without SSE (data inside cache).
    Processing time: 4.73msec

Utilized memory bandwidth: 100/6.22 = 16GB/Sec (dividing data size by time).
Average clocks per iteration with SSE (data in cache): (3.6e9*3.86e-3)/(25/4*1024^2) = 2.1 clks/iteration (dividing total CPU clocks by number of iterations).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!