I would suggest to use 128-bit AVX instructions whenever possible. It will reduce the latency of one cross-domain shuffle (2 cycles on Intel Sandy/Ivy Bridge) and improve efficiency on CPUs which run AVX instructions on 128-bit execution units (currently AMD Bulldozer, Piledriver, Steamroller, and Jaguar):
static inline float _mm256_reduce_add_ps(__m256 x) {
/* ( x3+x7, x2+x6, x1+x5, x0+x4 ) */
const __m128 x128 = _mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x));
/* ( -, -, x1+x3+x5+x7, x0+x2+x4+x6 ) */
const __m128 x64 = _mm_add_ps(x128, _mm_movehl_ps(x128, x128));
/* ( -, -, -, x0+x1+x2+x3+x4+x5+x6+x7 ) */
const __m128 x32 = _mm_add_ss(x64, _mm_shuffle_ps(x64, x64, 0x55));
/* Conversion to float is a no-op on x86-64 */
return _mm_cvtss_f32(x32);
}