I did a test with this
for (i32 i = 0; i < 0x800000; ++i)
{
// Hopefully this can disable hardware prefetch
i32 k = (i * 997 & 0x7
If your computation chain is very short and if you're reading memory sequentially then the CPU will prefetch well on its own and actually work faster since its decoder has less work to do.
Streaming loads and stores are good only if you don't plan to access this memory in the near future. They are mainly aimed at uncached write back (WB) memory that's usually found when dealing with graphic surfaces. Explicit prefecthing may work well on one architecture (CPU model) and have a negative effect on other models so use them as a last resort option when optimizing.