Is this code
float a = ...; __m256 b = _mm_broadcast_ss(&a)
always faster than this code
float a = ...; _mm_set1_ps(a)
If you target AVX instruction set, gcc will use VBROADCASTSS to implement _mm_set1_ps intrinsic. Clang, however, will use two instructions (VMOVSS + VPSHUFD).