In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:
#
Here is a variant built upon Z Boson's original answer (before edit), using two 128-bit loads instead of one 256-bit load.
__m256d b01 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+0]));
__m256d b23 = _mm256_castpd128_pd256(_mm_load_pd(&b[4*k+2]));
__m256d b0101 = _mm256_permute2f128_pd(b01, b01, 0);
__m256d b2323 = _mm256_permute2f128_pd(b23, b23, 0);
__m256d b0000 = _mm256_permute_pd(b0101, 0);
__m256d b1111 = _mm256_permute_pd(b0101, 0xf);
__m256d b2222 = _mm256_permute_pd(b2323, 0);
__m256d b3333 = _mm256_permute_pd(b2323, 0xf);
In my case this is slightly faster than using one 256-bit load, possibly because the first permute can start before the second 128-bit load completes.
Edit: gcc compiles the two loads and the first 2 permutes into
vmovapd (%rdi),%xmm8
vmovapd 0x10(%rdi),%xmm4
vperm2f128 $0x0,%ymm8,%ymm8,%ymm1
vperm2f128 $0x0,%ymm4,%ymm4,%ymm2
Paul R's suggestion of using _mm256_broadcast_pd() can be written as:
__m256d b0101 = _mm256_broadcast_pd((__m128d*)&b[4*k+0]);
__m256d b2323 = _mm256_broadcast_pd((__m128d*)&b[4*k+2]);
which compiles into
vbroadcastf128 (%rdi),%ymm6
vbroadcastf128 0x10(%rdi),%ymm11
and is faster than doing two vmovapd+vperm2f128 (tested).
In my code, which is bound by vector execution ports instead of L1 cache accesses, this is still slightly slower than 4 _mm256_broadcast_sd(), but I imagine that L1 bandwidth-constrained code can benefit greatly from this.