Load address calculation when using AVX2 gather instructions

Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD:

__m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale);

What isn't clear to me from the documentation is whether the calculated load address is an element address or a byte address, i.e. is the load address for element i:

load_addr = base + index[i] * scale;               // (1) element addressing ?

or:

load_addr = (char *)base + index[i] * scale;       // (2) byte addressing ?

From the Intel docs it looks like it might be (2), but this doesn't make much sense given that the smallest element size for gathered loads is 32 bits - why would you want to load from misaligned addresses (i.e. use scale < 4) ?

Gather instructions do not have any alignment requirements. So it would be too restrictive not to allow byte addressing.

Other reason is consistency. With SIB addressing we obviously have byte address:

MOV eax, [rcx + rdx * 2]

Since VPGATHERDD is just a vectorized variant of this MOV instruction, we should not expect anything different with VSIB addressing:

VPGATHERDD ymm0, [rcx + ymm2 * 2], ymm3

As for real life use for byte addressing, we could have a 24-bit color image where each pixel is 3-byte aligned. We could load 8 pixels with single VPGATHERDD instruction but only if "scale" field in VSIB is "1" and VPGATHERDD uses byte addressing.

Judging by the description in Intel's AVX programming reference document available here, it looks like the gather instructions use byte addressing. Specifically, see the following quotes from the description of the VPGATHERDD instruction (on page 389):

DISP: optional 1, 2, 4 byte displacement;
DATA_ADDR = BASE_ADDR + (SignExtend(VINDEX[i+31:i])*SCALE + DISP;

Since you can use 1/2/4 byte displacements, I would assume that the overall memory address is a byte address. While it may not be a common application, there could be cases where you would want to read a 32- or 64-bit value from a misaligned address. That's one of the more flexible things about the x86 architecture when compared to something like ARM; you have the flexibility to perform misaligned accesses if you want, instead of triggering a CPU exception as some others do.

why would you want to load from misaligned addresses (i.e. use scale < 4) ?

Misaligned loads are not the only use-case for scale < element size. You might just have indices that are pre-scaled byte offsets. Or consider vectorizing a loop over an array of pointers to structs: you could gather with a base "address" of zero or a small integer offset into the struct.

Supporting this use-case is one reason for Intel to design the asm instruction to support this, because gathers are supposed to help compilers auto-vectorize more code. It also fits naturally for the VSIB byte to be very close to SIB in the machine-code encoding, but they could easily have pre-biased the scale factor to give you a choice of scale = 4,8,16,32 (or 8,16,32,64 for qword gathers) with the 2-bit scale field.

Scale factors larger than the element size are not obviously useful in many cases either, and are easy to emulate with a single left-shift instruction before the gather. But it would be impossible to work around a baked-in scale factor, so allowing unscaled indices is clearly the more flexible design choice.

Other use-cases: gathering 16-bit elements. Use a 32-bit gather and mask off the top half of each element after the gather. (Or just leave it holding garbage). That would result in misaligned loads if any of your indices are odd (for a scale-factor of 2), so it's could be slow if they cross 4k boundaries (unlike a true 16-bit gather).

You could also imagine using a gather as part of a decompression function, where after some decoding you have a vector of offsets into a buffer, and you want arbitrary 4-byte or 8-byte windows of data.

来源：https://stackoverflow.com/questions/16193434/load-address-calculation-when-using-avx2-gather-instructions

标签

x86

sse

simd

avx2