I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorizati
The biggest problem here is that you benchmarked with optimization disabled. GCC's default is -O0 debug-mode which keeps all variables in memory between C statements! That's generally useless and massively distorts your results by introducing a store/reload into the dependency chain from the output of one iteration to the input of the next.
Using vector operations exploits SIMD parallelism in your program. But it does not speed up the sequential parts of your program, like the time it takes to load your program or to print to the screen. This limits the maximum speedup your program can attain. This is Amdahl's law.
In addition, your x86 processor takes advantage of a parallelism even in non-SIMD code. Intel's Haswell processor has four scalar-integer ALUs, so it can do 4 adds per clock if 4 add instructions have their inputs ready that cycle.
Two of Haswell's execution ports have SIMD-integer execution units that can run paffffd. But your loop only has one dependency chain for paffffd, vs. four independent ones for add.
Instruction-throughput bottlenecks are also a factor: the front-end can only supply up to 4 uops per clock. All the store/reload mov instructions mean the scalar version may be bumping into that bottleneck. With 2x mov-load + add + mov-store, the front-end can only supply 1 block of 4 instructions (including 1 add) per clock cycle. But the store-forwarding bottleneck lengthens the dependency chain from 1 cycle for add on its own to about 5 or 6 cycles with add + store/reload, so those dependency chains can still overlap.
So you are comparing the execution time not for a sequential execution and a parallel execution, but of two parallel executions. One with scalar ILP and one with SIMD.
Anti-optimized debug-mode code is a huge bottleneck for your SIMD vector, too. Really it's a bigger bottleneck because there's less other work to fill that gap created by the latency. SIMD store/reload is about a cycle higher latency than scalar integer, too.
See https://stackoverflow.com/tags/x86/info and https://agner.org/optimize/ for more details. Also David Kanter's Haswell microarchitecture deep dive for some block diagrams of the CPU along with explanations.