How to increase performance of memcpy

后端 未结 8 2112
春和景丽
春和景丽 2020-12-04 07:11

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full d

8条回答
  •  北荒
    北荒 (楼主)
    2020-12-04 08:12

    I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

    Performance (10000x 4MB block memcpy):
    
     1 thread :  1826 MB/sec
     2 threads:  3118 MB/sec
     3 threads:  4121 MB/sec
     4 threads: 10020 MB/sec
     5 threads: 12848 MB/sec
     6 threads: 14340 MB/sec
     8 threads: 17892 MB/sec
    10 threads: 21781 MB/sec
    12 threads: 25721 MB/sec
    14 threads: 25318 MB/sec
    16 threads: 19965 MB/sec
    24 threads: 13158 MB/sec
    32 threads: 12497 MB/sec
    

    I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

    I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

    #define NUM_CPY_THREADS 4
    
    HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
    HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
    typedef struct
    {
        int ct;
        void * src, * dest;
        size_t size;
    } mt_cpy_t;
    
    mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};
    
    DWORD WINAPI thread_copy_proc(LPVOID param)
    {
        mt_cpy_t * p = (mt_cpy_t * ) param;
    
        while(1)
        {
            WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
            memcpy(p->dest, p->src, p->size);
            ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
        }
    
        return 0;
    }
    
    int startCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
            mtParamters[ctr].ct = ctr;
            hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
        }
    
        return 0;
    }
    
    void * mt_memcpy(void * dest, void * src, size_t bytes)
    {
        //set up parameters
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
            mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
        }
    
        //release semaphores to start computation
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
            ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);
    
        //wait for all threads to finish
        WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);
    
        return dest;
    }
    
    int stopCopyThreads()
    {
        for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        {
            TerminateThread(hCopyThreads[ctr], 0);
            CloseHandle(hCopyStartSemaphores[ctr]);
            CloseHandle(hCopyStopSemaphores[ctr]);
        }
        return 0;
    }
    

提交回复
热议问题