Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math
Does anyone know why GCC/Clang will not optimist function test1 in the below code sample to simply use just the RCPPS instruction when using the fast-math option? Is there another compiler flag that would generate this code? typedef float float4 __attribute__((vector_size(16))); float4 test1(float4 v) { return 1.0f / v; } You can see the compiled output here: https://goo.gl/jXsqat Because the precision of RCPPS is a lot lower than float division. An option to enable that optimization would not be appropriate as part of -ffast-math . The x86 target options of the gcc manual says there in fact