C# - How to convert byte array of image pixels data to grayscale using vector SSE operation

问题

I have a problem with converting the image data stored in byte[] array to grayscale. I want to use vector SIMD operations because in future a need to write ASM and C++ DLL files to measure operations time.

When I read about SIMD I found that SSE command is operation on 128-bit registers so there is a problem because I need to convert my byte[] array into few Vector<T> stored into List<T>.

Image is four channels RGBA JPEG so I need also to know how to create vectors with R, G, B data based on single 128-bit Vector<T>. After that, I can use the Grayscale algorithm

fY(R, G, B) ＝ R x 0.29891 + G x 0.58661 + B x 0.11448

All in all the questions are:

How to load chunks of byte[] array into 128-bit registers Vector<T>.
How to separate for one Vector<T> the R, G, B value to multiply it and copy to source Vector.

回答1:

It requires System.Runtime.Intrinsics.Experimental.dll and unsafe, but it’s relatively straightforward, and probably fast enough for many practical applications.

/// <summary>Load 4 pixels of RGB</summary>
static unsafe Vector128<int> load4( byte* src )
{
    return Sse2.LoadVector128( (int*)src );
}

/// <summary>Pack red channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packRed( Vector128<int> a, Vector128<int> b )
{
    Vector128<int> mask = Vector128.Create( 0xFF );
    a = Sse2.And( a, mask );
    b = Sse2.And( b, mask );
    return Sse2.ShiftLeftLogical128BitLane( Sse41.PackUnsignedSaturate( a, b ), 1 );
}

/// <summary>Pack green channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packGreen( Vector128<int> a, Vector128<int> b )
{
    Vector128<int> mask = Vector128.Create( 0xFF00 );
    a = Sse2.And( a, mask );
    b = Sse2.And( b, mask );
    return Sse41.PackUnsignedSaturate( a, b );
}

/// <summary>Pack blue channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packBlue( Vector128<int> a, Vector128<int> b )
{
    a = Sse2.ShiftRightLogical128BitLane( a, 1 );
    b = Sse2.ShiftRightLogical128BitLane( b, 1 );
    Vector128<int> mask = Vector128.Create( 0xFF00 );
    a = Sse2.And( a, mask );
    b = Sse2.And( b, mask );
    return Sse41.PackUnsignedSaturate( a, b );
}

/// <summary>Load 8 pixels, split into RGB channels.</summary>
static unsafe void loadRgb( byte* src, out Vector128<ushort> red, out Vector128<ushort> green, out Vector128<ushort> blue )
{
    var a = load4( src );
    var b = load4( src + 16 );
    red = packRed( a, b );
    green = packGreen( a, b );
    blue = packBlue( a, b );
}

const ushort mulRed = (ushort)( 0.29891 * 0x10000 );
const ushort mulGreen = (ushort)( 0.58661 * 0x10000 );
const ushort mulBlue = (ushort)( 0.11448 * 0x10000 );

/// <summary>Compute brightness of 8 pixels</summary>
static Vector128<short> brightness( Vector128<ushort> r, Vector128<ushort> g, Vector128<ushort> b )
{
    r = Sse2.MultiplyHigh( r, Vector128.Create( mulRed ) );
    g = Sse2.MultiplyHigh( g, Vector128.Create( mulGreen ) );
    b = Sse2.MultiplyHigh( b, Vector128.Create( mulBlue ) );
    var result = Sse2.AddSaturate( Sse2.AddSaturate( r, g ), b );
    return Vector128.AsInt16( Sse2.ShiftRightLogical( result, 8 ) );
}

/// <summary>Convert buffer from RGBA to grayscale.</summary>
/// <remarks>
/// <para>If your image has line paddings, you'll want to call this once per line, not for the complete image.</para>
/// <para>If width of the image is not multiple of 16 pixels, you'll need to do more work to handle the last few pixels of every line.</para>
/// </remarks>
static unsafe void convertToGrayscale( byte* src, byte* dst, int count )
{
    byte* srcEnd = src + count * 4;
    while( src < srcEnd )
    {
        loadRgb( src, out var r, out var g, out var b );
        var low = brightness( r, g, b );
        loadRgb( src + 32, out r, out g, out b );
        var hi = brightness( r, g, b );

        var bytes = Sse2.PackUnsignedSaturate( low, hi );
        Sse2.Store( dst, bytes );

        src += 64;
        dst += 16;
    }
}

However, equivalent C++ implementation would be faster. C# did a decent job inlining these functions i.e. convertToGrayscale contains no function calls. But the code of that function is far from optimal. The .NET failed to propagate constants, for the magic numbers it emitted code like this inside the loop:

mov         r8d,962Ch
vmovd       xmm1,r8d
vpbroadcastw xmm1,xmm1

The generated code only uses 6 out of 16 registers. There’re enough available registers for all the magic numbers involved.

Also .NET emits many redundant instructions which just shuffle data around:

vmovaps xmm2, xmm0
vmovaps xmm3, xmm1

来源：https://stackoverflow.com/questions/58881359/c-sharp-how-to-convert-byte-array-of-image-pixels-data-to-grayscale-using-vect

标签

image

vectorization

sse

simd