I\'m writing some AVX code and I need to load from potentially unaligned memory. I\'m currently loading 4 doubles, hence I would use intrinsic instruction
GCC's generic tuning splits unaligned 256-bit loads to help older processors. (Subsequent changes avoid splitting loads in generic tuning, I believe.)
You can tune for more recent Intel CPUs using something like -mtune=intel or -mtune=skylake, and you will get a single instruction, as intended.
-mtune=intel
-mtune=skylake