The dot product of two arrays
for(int i=0; i
does not reuse data so it should be a memory bound opera
I made my own memory benchmark code https://github.com/zboson/bandwidth
Here are the current results for eight threads:
write: 0.5 GB, time 2.96e-01 s, 18.11 GB/s
copy: 1 GB, time 4.50e-01 s, 23.85 GB/s
scale: 1 GB, time 4.50e-01 s, 23.85 GB/s
add: 1.5 GB, time 6.59e-01 s, 24.45 GB/s
mul: 1.5 GB, time 6.56e-01 s, 24.57 GB/s
triad: 1.5 GB, time 6.61e-01 s, 24.37 GB/s
vsum: 0.5 GB, time 1.49e-01 s, 36.09 GB/s, sum -8.986818e+03
vmul: 0.5 GB, time 9.00e-05 s, 59635.10 GB/s, sum 0.000000e+00
vmul_sum: 1 GB, time 3.25e-01 s, 33.06 GB/s, sum 1.910421e+04
Here are the currents results for 1 thread:
write: 0.5 GB, time 4.65e-01 s, 11.54 GB/s
copy: 1 GB, time 7.51e-01 s, 14.30 GB/s
scale: 1 GB, time 7.45e-01 s, 14.41 GB/s
add: 1.5 GB, time 1.02e+00 s, 15.80 GB/s
mul: 1.5 GB, time 1.07e+00 s, 15.08 GB/s
triad: 1.5 GB, time 1.02e+00 s, 15.76 GB/s
vsum: 0.5 GB, time 2.78e-01 s, 19.29 GB/s, sum -8.990941e+03
vmul: 0.5 GB, time 1.15e-05 s, 468719.08 GB/s, sum 0.000000e+00
vmul_sum: 1 GB, time 5.72e-01 s, 18.78 GB/s, sum 1.910549e+04
memset
.a(i) = b(i) * c(i)
sum += a(i)
sum *= a(i)
sum += a(i)*b(i)
// the dot productMy results are consistent with STREAM. I get the highest bandwidth for vsum
. The vmul
method does not work currently (once the value is zero it finishes early). I can get slightly better results (by about 10%) using intrinsics and unrolling the loop which I will add later.