How to increase performance of memcpy

后端 未结 8 2089
春和景丽
春和景丽 2020-12-04 07:11

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full d

相关标签:
8条回答
  • 2020-12-04 08:09

    You have a few barriers to obtaining the required memory performance:

    1. Bandwidth - there is a limit to how quickly data can move from memory to the CPU and back again. According to this Wikipedia article, 266MHz DDR3 RAM has an upper limit of around 17GB/s. Now, with a memcpy you need to halve this to get your maximum transfer rate since the data is read and then written. From your benchmark results, it looks like you're not running the fastest possible RAM in your system. If you can afford it, upgrade the motherboard / RAM (and it won't be cheap, Overclockers in the UK currently have 3x4GB PC16000 at £400)

    2. The OS - Windows is a preemptive multitasking OS so every so often your process will be suspended to allow other processes to have a look in and do stuff. This will clobber your caches and stall your transfer. In the worst case your entire process could be cached to disk!

    3. The CPU - the data being moved has a long way to go: RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM. There may even be an L3 cache. If you want to involve the CPU you really want to be loading L2 whilst copying L1. Unfortunately, modern CPUs can run through an L1 cache block quicker than the time taken to load the L1. The CPU has a memory controller that helps a lot in these cases where your streaming data into the CPU sequentially but you're still going to have problems.

    Of course, the faster way to do something is to not do it. Can the captured data be written anywhere in RAM or is the buffer used at a fixed location. If you can write it anywhere, then you don't need the memcpy at all. If it's fixed, could you process the data in place and use a double buffer type system? That is, start capturing data and when it's half full, start processing the first half of the data. When the buffer's full, start writing captured data to the start and process the second half. This requires that the algorithm can process the data faster than the capture card produces it. It also assumes that the data is discarded after processing. Effectively, this is a memcpy with a transformation as part of the copy process, so you've got:

    load -> transform -> save
    \--/                 \--/
     capture card        RAM
       buffer
    

    instead of:

    load -> save -> load -> transform -> save
    \-----------/
    memcpy from
    capture card
    buffer to RAM
    

    Or get faster RAM!

    EDIT: Another option is to process the data between the data source and the PC - could you put a DSP / FPGA in there at all? Custom hardware will always be faster than a general purpose CPU.

    Another thought: It's been a while since I've done any high performance graphics stuff, but could you DMA the data into the graphics card and then DMA it out again? You could even take advantage of CUDA to do some of the processing. This would take the CPU out of the memory transfer loop altogether.

    0 讨论(0)
  • 2020-12-04 08:12

    I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

    Performance (10000x 4MB block memcpy):
    
     1 thread :  1826 MB/sec
     2 threads:  3118 MB/sec
     3 threads:  4121 MB/sec
     4 threads: 10020 MB/sec
     5 threads: 12848 MB/sec
     6 threads: 14340 MB/sec
     8 threads: 17892 MB/sec
    10 threads: 21781 MB/sec
    12 threads: 25721 MB/sec
    14 threads: 25318 MB/sec
    16 threads: 19965 MB/sec
    24 threads: 13158 MB/sec
    32 threads: 12497 MB/sec
    

    I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

    I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

    #define NUM_CPY_THREADS 4
    
    HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
    typedef struct
    {
        int ct;
        void * src, * dest;
        size_t size;
    } mt_cpy_t;
    
    mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};
    
    DWORD WINAPI thread_copy_proc(LPVOID param)
    {
        mt_cpy_t * p = (mt_cpy_t * ) param;
    
        while(1)
        {
            WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
            memcpy(p->dest, p->src, p->size);
            ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
        }
    
        return 0;
    }
    
    int startCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            mtParamters[ctr].ct = ctr;
            hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
        }
    
        return 0;
    }
    
    void * mt_memcpy(void * dest, void * src, size_t bytes)
    {
        //set up parameters
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
        }
    
        //release semaphores to start computation
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
            ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);
    
        //wait for all threads to finish
        WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);
    
        return dest;
    }
    
    int stopCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            TerminateThread(hCopyThreads[ctr], 0);
            CloseHandle(hCopyStartSemaphores[ctr]);
            CloseHandle(hCopyStopSemaphores[ctr]);
        }
        return 0;
    }
    
    0 讨论(0)
提交回复
热议问题