问题
I have a problem with converting the image data stored in byte[] array
to grayscale. I want to use vector SIMD operations because in future a need to write ASM and C++ DLL files to measure operations time.
When I read about SIMD I found that SSE command is operation on 128-bit registers so there is a problem because I need to convert my byte[] array
into few Vector<T>
stored into List<T>.
Image is four channels RGBA JPEG so I need also to know how to create vectors with R, G, B data based on single 128-bit Vector<T>
. After that, I can use the Grayscale algorithm
fY(R, G, B) = R x 0.29891 + G x 0.58661 + B x 0.11448
All in all the questions are:
- How to load chunks of
byte[] array
into 128-bit registersVector<T>
. - How to separate for one
Vector<T>
the R, G, B value to multiply it and copy to source Vector.
回答1:
It requires System.Runtime.Intrinsics.Experimental.dll and unsafe, but it’s relatively straightforward, and probably fast enough for many practical applications.
/// <summary>Load 4 pixels of RGB</summary>
static unsafe Vector128<int> load4( byte* src )
{
return Sse2.LoadVector128( (int*)src );
}
/// <summary>Pack red channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packRed( Vector128<int> a, Vector128<int> b )
{
Vector128<int> mask = Vector128.Create( 0xFF );
a = Sse2.And( a, mask );
b = Sse2.And( b, mask );
return Sse2.ShiftLeftLogical128BitLane( Sse41.PackUnsignedSaturate( a, b ), 1 );
}
/// <summary>Pack green channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packGreen( Vector128<int> a, Vector128<int> b )
{
Vector128<int> mask = Vector128.Create( 0xFF00 );
a = Sse2.And( a, mask );
b = Sse2.And( b, mask );
return Sse41.PackUnsignedSaturate( a, b );
}
/// <summary>Pack blue channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packBlue( Vector128<int> a, Vector128<int> b )
{
a = Sse2.ShiftRightLogical128BitLane( a, 1 );
b = Sse2.ShiftRightLogical128BitLane( b, 1 );
Vector128<int> mask = Vector128.Create( 0xFF00 );
a = Sse2.And( a, mask );
b = Sse2.And( b, mask );
return Sse41.PackUnsignedSaturate( a, b );
}
/// <summary>Load 8 pixels, split into RGB channels.</summary>
static unsafe void loadRgb( byte* src, out Vector128<ushort> red, out Vector128<ushort> green, out Vector128<ushort> blue )
{
var a = load4( src );
var b = load4( src + 16 );
red = packRed( a, b );
green = packGreen( a, b );
blue = packBlue( a, b );
}
const ushort mulRed = (ushort)( 0.29891 * 0x10000 );
const ushort mulGreen = (ushort)( 0.58661 * 0x10000 );
const ushort mulBlue = (ushort)( 0.11448 * 0x10000 );
/// <summary>Compute brightness of 8 pixels</summary>
static Vector128<short> brightness( Vector128<ushort> r, Vector128<ushort> g, Vector128<ushort> b )
{
r = Sse2.MultiplyHigh( r, Vector128.Create( mulRed ) );
g = Sse2.MultiplyHigh( g, Vector128.Create( mulGreen ) );
b = Sse2.MultiplyHigh( b, Vector128.Create( mulBlue ) );
var result = Sse2.AddSaturate( Sse2.AddSaturate( r, g ), b );
return Vector128.AsInt16( Sse2.ShiftRightLogical( result, 8 ) );
}
/// <summary>Convert buffer from RGBA to grayscale.</summary>
/// <remarks>
/// <para>If your image has line paddings, you'll want to call this once per line, not for the complete image.</para>
/// <para>If width of the image is not multiple of 16 pixels, you'll need to do more work to handle the last few pixels of every line.</para>
/// </remarks>
static unsafe void convertToGrayscale( byte* src, byte* dst, int count )
{
byte* srcEnd = src + count * 4;
while( src < srcEnd )
{
loadRgb( src, out var r, out var g, out var b );
var low = brightness( r, g, b );
loadRgb( src + 32, out r, out g, out b );
var hi = brightness( r, g, b );
var bytes = Sse2.PackUnsignedSaturate( low, hi );
Sse2.Store( dst, bytes );
src += 64;
dst += 16;
}
}
However, equivalent C++ implementation would be faster. C# did a decent job inlining these functions i.e. convertToGrayscale
contains no function calls.
But the code of that function is far from optimal. The .NET failed to propagate constants, for the magic numbers it emitted code like this inside the loop:
mov r8d,962Ch
vmovd xmm1,r8d
vpbroadcastw xmm1,xmm1
The generated code only uses 6 out of 16 registers. There’re enough available registers for all the magic numbers involved.
Also .NET emits many redundant instructions which just shuffle data around:
vmovaps xmm2, xmm0
vmovaps xmm3, xmm1
来源:https://stackoverflow.com/questions/58881359/c-sharp-how-to-convert-byte-array-of-image-pixels-data-to-grayscale-using-vect