Fastest way to copy a blittable struct to an unmanaged memory location (IntPtr)

心不动则不痛 提交于 2019-12-04 15:38:06

One answer is to reimplement native memcpy instead in C#, making use of the same optimizing tricks that native memcpy attempts to do. You can see Microsoft doing this in their own source. See the Buffer.cs file in the Microsoft Reference Source:

     // This is tricky to get right AND fast, so lets make it useful for the whole Fx.
     // E.g. System.Runtime.WindowsRuntime!WindowsRuntimeBufferExtensions.MemCopy uses it.
     internal unsafe static void Memcpy(byte* dest, byte* src, int len) {

        // This is portable version of memcpy. It mirrors what the hand optimized assembly versions of memcpy typically do.
        // Ideally, we would just use the cpblk IL instruction here. Unfortunately, cpblk IL instruction is not as efficient as
        // possible yet and so we have this implementation here for now.

        switch (len)
        {
        case 0:
            return;
        case 1:
            *dest = *src;
            return;
        case 2:
            *(short *)dest = *(short *)src;
            return;
        case 3:
            *(short *)dest = *(short *)src;
            *(dest + 2) = *(src + 2);
            return;
        case 4:
            *(int *)dest = *(int *)src;
            return;
        ...

Its interesting to note that they natively implement memcpy for all sizes up to 512; most of the sizes use pointer aliasing tricks to get the VM to emit instructions that operate on differing sizes. Only at 512 do they finally drop into invoking the native memcpy:

        // P/Invoke into the native version for large lengths
        if (len >= 512)
        {
            _Memcpy(dest, src, len);
            return;
        }

Presumably, native memcpy is even faster since it can be hand optimized to use SSE/MMX instructions to perform the copy.

As per BenVoigt's suggestion, I tried a few options. For all these tests I compiled with Any CPU architecture, on a standard VS2013 Release build, and ran the test outside of the IDE. Before each test was measured, the methods DoTestA() and DoTestB() were run multiple times to allow the JIT warmup.


First, I compared Marshal.StructToPtr to a byte-by-byte loop with various struct sizes. I've shown the code below using a SixtyFourByteStruct:

private unsafe static void DoTestA() {
    fixed (SixtyFourByteStruct* fixedStruct = &structToCopy) {
        byte* structStart = (byte*) fixedStruct;
        byte* targetStart = (byte*) unmanagedTarget;
        for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(SixtyFourByteStruct); ++structPtr, ++targetPtr) {
            *targetPtr = *structPtr;
        }
    }
}

private static void DoTestB() {
    Marshal.StructureToPtr(structToCopy, unmanagedTarget, false);
}

And the results:

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        82ns         0ns          22,000ns     21,917ns     ! 41.017ms
B        137ns        0ns          38,700ns     38,562ns     ! 68.834ms

As you can see, the manual loop is faster (as I suspected). The results are similar for a sixteen-byte and four-byte struct, with the difference being more pronounced the smaller the struct goes.


So now, to try the manual copy vs using P/Invoke and memcpy:

private unsafe static void DoTestA() {
    fixed (FourByteStruct* fixedStruct = &structToCopy) {
        byte* structStart = (byte*) fixedStruct;
        byte* targetStart = (byte*) unmanagedTarget;
        for (byte* structPtr = structStart, targetPtr = targetStart; structPtr < structStart + sizeof(FourByteStruct); ++structPtr, ++targetPtr) {
            *targetPtr = *structPtr;
        }
    }
}

private unsafe static void DoTestB() {
    fixed (FourByteStruct* fixedStruct = &structToCopy) {
        memcpy(unmanagedTarget, (IntPtr) fixedStruct, new UIntPtr((uint) sizeof(FourByteStruct)));
    }
}

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        61ns         0ns          28,000ns     27,938ns     ! 30.736ms
B        84ns         0ns          45,900ns     45,815ns     ! 42.216ms

So, it seems that the manual copy is still better in my case. Like before, the results were pretty similar for 4/16/64 byte structs (though the gap was <10ns for 64-byte size).


It occurred to me that I was only testing structures that fit on a cache line (I have a standard x86_64 CPU). So I tried a 128-byte structure, and it swung the balance in the favour of memcpy:

>>> 500000 repetitions >>> IN NANOSECONDS (1000ns = 0.001ms)
Method   Avg.         Min.         Max.         Jitter       Total
A        104ns        0ns          48,300ns     48,195ns     ! 52.150ms
B        84ns         0ns          38,400ns     38,315ns     ! 42.284ms

Anyway, the conclusion to all that is that the byte-by-byte copy seems the fastest for any struct of size <=64 bytes on an x86_64 CPU on my machine. Take it as you will (and maybe someone will spot an inefficiency in my code anyway).

FYI. I'm posting how I leveraged the accepted answer for others' benefit as there's a twist when accessing the method via reflection because it's overloaded.

public static class Buffer
{
    public unsafe delegate void MemcpyDelegate(byte* dest, byte* src, int len);

    public static readonly MemcpyDelegate Memcpy;
    static Buffer()
    {
        var methods = typeof (System.Buffer).GetMethods(BindingFlags.Static | BindingFlags.NonPublic).Where(m=>m.Name == "Memcpy");
        var memcpy = methods.First(mi => mi.GetParameters().Select(p => p.ParameterType).SequenceEqual(new[] {typeof (byte*), typeof (byte*), typeof (int)}));
        Memcpy = (MemcpyDelegate) memcpy.CreateDelegate(typeof (MemcpyDelegate));
    }
}

Usage:

public static unsafe void MemcpyExample()
{
     int src = 12345;
     int dst = 0;
     Buffer.Memcpy((byte*) &dst, (byte*) &src, sizeof (int));
     System.Diagnostics.Debug.Assert(dst==12345);
}
   public void SetVariable<T>(T newValue) where T : struct

You cannot use generics to accomplish this the fast way. The compiler doesn't take your pretty blue eyes as a guarantee that T is actually blittable, the constraint isn't good enough. You should use overloads:

    public unsafe void SetVariable(int newValue) {
        *(int*)varPtr = newValue;
    }
    public unsafe void SetVariable(double newValue) {
        *(double*)varPtr = newValue;
    }
    public unsafe void SetVariable(Point newValue) {
        *(Point*)varPtr = newValue;
    }
    // etc...

Which might be inconvenient, but blindingly fast. It compiles to single MOV instruction with no method call overhead in Release mode. The fastest it could be.

And the back-up case, the profiler will tell you when you need to overload:

    public unsafe void SetVariable<T>(T newValue) {
        Marshal.StructureToPtr(newValue, (IntPtr)varPtr, false);
    }
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!