I want to calculate y = ax + b
, where x and y is a pixel value [i.e, byte with value range is 0~255], while a
and b
is a float
a
and b
are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them
Is there any way you can efficiently generate a
and b
in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
If a
and b
can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.
There's a _mm_maddubs_epi16
instruction which might be useful if b
and a
change infrequently enough, or you can easily generate a vector with alternating a*2^4 and b*2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.
float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a
, and 4 fraction bits. So it effectively rounds a
to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a
from -2 to +1 and 63/64ths.
Since b
is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.
For more precision, 16b fixed-point is a good choice. You can scale a
and b
up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:
_mm_mulhi_epu16
: unsigned 16b*16b->high16 (bits [31:16]). Useful if a
can't be negative_mm_mulhi_epi16
: signed 16b*16b->high16 (bits [31:16])._mm_mulhrs_epi16
: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a
, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use._mm_mullo_epi16
: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a
before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16
8bit solution is more precision for b
.To use these, you'd get scaled 16b vectors of a
and b
values, then:
pmovzx
byte->word), to get signed words still in the [0..255] rangea
vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mula
and b
, to get more fractional precision for a
b
to that.With a good choice of fixed-point scale, this should be able to handle a wider range of a
and b
, as well as more fractional precision, than 8bit fixed point.
If you don't left-shift your bytes after unpacking them to words, a
has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a
that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16
doesn't leave much room, since it starts at bit 30.
If you can't efficiently generate fixed-point a
and b
values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.
For packed-float to work, you still have to efficiently build a vector of a
values for 4 adjacent pixels.
This is a good use-case for pmovzx
(SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd
with multiple steps, or SSSE3 pshufb
to emulate pmovzx
. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)
char *buf;
// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd
is only useful if your final goal is uint16_t
, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw
and so on (except for q->dq). 4x slow pshufb
should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.
See the edit history for an unfinished attempt at using punpck
before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.