Here is a C++ code:
#define ARR_SIZE_TEST ( 8 * 1024 * 1024 )
void cpp_tst_add( unsigned* x, unsigned* y )
{
for ( register int i = 0; i < ARR_SIZE_T
Although you're limited by latency to main-memory in this case it's not exactly obvious that the NEON version would be slower than the ASM version.
Using the cycle calculator here:
http://pulsar.webshaker.net/ccc/result.php?lng=en
Your code should take 7 cycles before the cache miss penalties. It's slower than you may expect because you're using unaligned loads and because of latency between the add and the store.
Meanwhile, the compiler generated loop takes 6 cycles (it's not very well scheduled or optimized in general either). But it's doing one fourth as much work.
The cycle counts from the script might not be perfect, but I don't see anything that looks blatantly wrong with it so I think they'd at least be close. There's potential for taking an extra cycle on the branch if you max out fetch bandwidth (also if the loops aren't 64-bit aligned), but in this case there are plenty of stalls to hide that.
The answer isn't that integer on Cortex-A8 has more opportunities to hide latency. In fact, it normally has less, because of NEON's staggered pipeline and issue queue. Of course, this is only true on Cortex-A8 - on Cortex-A9 the situation may well be reversed (NEON is dispatched in-order and in parallel with integer, while integer has out-of-order capabilities). Since you tagged this Cortex-A8 I'm assuming that's what you're using.
This begs more investigation. Here are some ideas why this could be happening:
You asked what good NEON is in cases like this - in reality, NEON is especially good for these cases where you're streaming to/from memory. The trick is that you need to use preloading in order to hide the main memory latency as much as possible. Preload will get memory into L2 (not L1) cache ahead of time. Here NEON has a big advantage over integer because it can hide a lot of the L2 cache latency, due to its staggered pipeline and issue queue but also because it has a direct path to it. I expect you see effective L2 latency down to 0-6 cycles and less if you have less dependencies and don't exhaust the load queue, while on integer you can be stuck with a good ~16 cycles that you can't avoid (probably depends on the Cortex-A8 though).
So I would recommend that you align your arrays to cache-line size (64 bytes), unroll your loops to do at least one cache-line at a time, use aligned loads/stores (put :128 after the address) and add a pld instruction that loads several cache-lines away. As for how many lines away: start small and keep increasing it until you no longer see any benefit.