I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:
#include
#in
EDIT: Modified the answer a lot. Also, please disregard most of what I wrote before about Mystical's answer not being entirely correct. Though, I still do not agree it being bottlenecked by memory, as despite doing a very wide variety of tests, I couldn't see any signs of the original code being bound by memory speed. Meanwhile it kept showing clear signs of being CPU-bound.
There can be many reasons. And since the reason[s] can be very hardware-dependent, I decided I shouldn't speculate based on guesses. Just going to outline these things I encountered during later testing, where I used a much more accurate and reliable CPU time measuring method and looping-the-loop 1000 times. I believe this information could be of help. But please take it with a grain of salt, as it's hardware dependent.
WRT Mystical's example of running nearly 1 iteration per clock - I didn't expect the CPU scheduler to be that efficient and was assuming 1 iteration every 1.5-2 clock ticks. But to my surprise, that is not the case; I sure was wrong, sorry about that. My own CPU ran it even more efficiently - 1.048 cycles/iteration. So I can attest to this part of Mystical's answer to be definitely right.