Is this code
float a = ...;
__m256 b = _mm_broadcast_ss(&a)
always faster than this code
float a = ...;
_mm_set1_ps(a)
mm_broadcast_ss is likely to be faster than mm_set1_ps. The former translates into a single instruction (VBROADCASTSS), while the latter is emulated using multiple instructions (probably a MOVSS followed by a shuffle). However, mm_broadcast_ss requires the AVX instruction set, while only SSE is required for mm_set1_ps.